Analysis of Belarus Used Car Market


Why Used Car Market in Belarus?

As it is widely known, for the last year and a half the world has been dealing with an unprecedented event; the corona virus pandemic. While this affected many areas of people’s lives, one thing that many did not talk about was its effects on the global supply chain. People stocked up early on during the pandemic, fearing a potential scarcity in finding some of the most commonly available consumer items. For example, hygienic wipes was one of the most popular scarce items for many months, most large market chains, CVS-target-Safeway, limited people from buying more than one swipe at once.

While the world is recovering from this once in a hundred years phenomena, car market was also hit by the sudden changes. In many countries around the world, it is very hard to find first hand cars (Isidore, 2021) and because of that reason more and more people are looking to the used car market. For this reason Team PatternSix found it fit to take a deep dive in to the used car market and help potential buyers/sellers to get the best prices for the specific features that they are looking for.

As prospective data scientists, Team PatternSix wanted to take a recent issue at hand just like a true data scientist does and explain the findings using the best up to date data analysis and data visualization techniques. PatternSix found the Belarus Car Market data particularly interesting due to the fact that not only the data set had the necessary amount of multi-level variables but also because of the fact that the team saw that there was a story to tell to the common consumer.

Prior Research

Research and analysis have been rampant in the field of used car prices. For example, a simple search on Google Scholar shows over a million articles written. Some studies are back from 1960s. Articles could be found from all over the World, from countries like Turkey to Australia.

One of the interesting researches that inspired team PatternSix was the impact of digital disruption (Ben Ellencweig, Sam Ezratty, Dan Fleming, and Itai Miller, 2019) . The most interesting takeaway from this research was the fact that used car market was not sensible to macro-economic shocks as much as new cars. Given that the World is going through a once in a decade catastrophe, this was an interesting point. The exhibit that is displayed in the analysis suggest that used car sales were affected less by crisis such as dot-com bubble or rising interest rates in the beginning of 1990s .

Considering that the Belarus used car data set was gathered from the web, this research was an important finding for this team’s research.

Data Preprocessing

Data Import and Cleaning

Renaming features

PatternSix renamed ‘price_usd’ to ‘price’.

Engine Capacity has 10 null values, PattaernSix dropped the rows with null values.

outliers = boxplot(df$price, plot=FALSE)$out
length(outliers)
[1] 1750
# df2<- df[which(df$price %in% outliers),]

There are length(outliers) outliers for price in this data set. These data will not be eliminated since they also reflect the actual situation in the used car market. They represent the group that contain relatively new cars with higher prices.

Summary of Dataset

• The Data set used for the project is “Belarus-Used-cars-catalog” taken from the public data source Kaggle (An online community of data scientists and machine learning practitioners).

Link: https://www.kaggle.com/lepchenkov/usedcarscatalog?select=cars.csv

• The Data set contains information about the Belarus (western Europe) used cars market from the year 2019.

• The total number of variables in the data set is 19.

• The total number of observations in the data set is 38521.

• This Data set helps the team in exploring the used car market in Belarus and build a model to find the relationship between car prices with changing features that can effectively predict the price of a used car, given the certain parameters (both numerical and categorical).

• From the Data set the team mainly focuses on these features as mentioned below to perform Exploratory Data Analysis:

• Color • Transmission • Odometer value • Year of Production • Body type • Number of Photos • Duration of days

Limitations of Dataset:

  1. The “Belarus-Used-cars-catalog” data set is limited to only Belarus which in effect does not help Pattern 6 to make assumptions about used car markets in other countries.

  2. There is no ‘electric’ car category as the data set is limited to gasoline and diesel.

  3. There could have been more features found in the data set which Team Pattern 6 could have used for the Exploratory Data Analysis and get a more detailed analysis when comparing multiple features.

SMART Questions

The following are the SMART questions which PatternSix came up with and followed.

Specific: Is it possible to build a model to find a relationship between car prices by looking at different factors that include numerical, categorical values and further use the model to predict car prices?

Measurable: Is it possible to measure metrics such as r-square, MAE, MSE and RMSE with the data set categories?

Achievable: Based on the preliminary analysis that the team concluded is it possible to find a pattern between target variable(car price) and the independent variable?

Relevant: Can the research help the sellers and buyers in the used car market to make an informed decision about the price of the vehicle?

Time Oriented: Will The initial analysis be completed by November, 2nd with the presentation?

Exploratory Data Analysis

# summary(cars_numerical)
library(fBasics) 
options(width = 300 )
basicStats(cars_numerical)
            odometer_value year_produced engine_capacity    price number_of_photos up_counter
nobs              3.85e+04      3.85e+04        3.85e+04 3.85e+04         3.85e+04   3.85e+04
NAs               0.00e+00      0.00e+00        0.00e+00 0.00e+00         0.00e+00   0.00e+00
Minimum           0.00e+00      1.94e+03        2.00e-01 1.00e+00         1.00e+00   1.00e+00
Maximum           1.00e+06      2.02e+03        8.00e+00 5.00e+04         8.60e+01   1.86e+03
1. Quartile       1.58e+05      2.00e+03        1.60e+00 2.10e+03         5.00e+00   2.00e+00
3. Quartile       3.25e+05      2.01e+03        2.30e+00 8.95e+03         1.20e+01   1.60e+01
Mean              2.49e+05      2.00e+03        2.06e+00 6.64e+03         9.65e+00   1.63e+01
Median            2.50e+05      2.00e+03        2.00e+00 4.80e+03         8.00e+00   5.00e+00
Sum               9.59e+09      7.72e+07        7.92e+04 2.56e+08         3.72e+05   6.28e+05
SE Mean           6.93e+02      4.11e-02        3.42e-03 3.27e+01         3.10e-02   2.21e-01
LCL Mean          2.48e+05      2.00e+03        2.05e+00 6.57e+03         9.59e+00   1.59e+01
UCL Mean          2.50e+05      2.00e+03        2.06e+00 6.70e+03         9.71e+00   1.67e+01
Variance          1.85e+10      6.50e+01        4.50e-01 4.13e+07         3.71e+01   1.87e+03
Stdev             1.36e+05      8.06e+00        6.71e-01 6.43e+03         6.09e+00   4.33e+01
Skewness          1.17e+00     -3.93e-01        2.05e+00 2.24e+00         1.60e+00   1.33e+01
Kurtosis          4.90e+00      6.54e-01        6.37e+00 7.28e+00         4.96e+00   3.08e+02

The table above gives the basic statistic measures of numeric data. There are six numerical variables in the dataset. The one that is most important is the used car’s price. It has mean=6640, standard deviation(sd)=6430. The odometer_value with mean=249000, sd=136000. The year_produced with mean=2000 and sd=8.06. The engine_capacity has mean=2.06 and sd=0.67. The absolute values of skewness for all the variables are all greater than 1, which indicates they are highly skewed. The kurtosis values are all greater than 0, indicating they are sharply peaked with heavy tails. More analysis between other variables is shown below.

Normality tests

This section checks the normality of numerical variables based on the Q-Q plot, histogram, and normality tests. The most common method for normality test is called Shapiro-Wilk’s method, however, this test only works when the observation is less than 5000,and Belarus used car market data set is more extensive than this value, so a Kolmogorov-Smirnov (K-S) normality test will be used instead.

library(gridExtra)
plot1 = ggplot(cars_numerical, aes(sample = price)) + stat_qq(col="#00AFBB") + stat_qq_line() + labs(title = 'Q-Q plot of price') 
plot2 = ggplot(cars_numerical, aes(x = price)) + geom_histogram(fill = "#00AFBB", colour="white", bins=40) + labs(title = 'Histogram of price')

grid.arrange(plot1, plot2, ncol=2, nrow=1)

As it could be found in the quantile-quantile plot and the histogram,price are not normally distributed, if PatternSix wants to use the price as the dependent variable for a linear regression, it is necessary to transform it to a normal distribution after that.

plot3 = ggplot(cars_numerical, aes(sample = odometer_value)) + stat_qq(col="#00AFBB") + stat_qq_line() + labs(title = 'Q-Q plot of odometer_value')
plot4 = ggplot(cars_numerical, aes(x = odometer_value)) + geom_histogram(fill = "#00AFBB", colour="white", bins=40) + labs(title = 'Histogram of odometer_value')

plot5 = ggplot(cars_numerical, aes(sample = year_produced)) + stat_qq(col="#00AFBB") + stat_qq_line() + labs(title = 'Q-Q plot of year_produced')
plot6 = ggplot(cars_numerical, aes(x = year_produced)) + geom_histogram(fill = "#00AFBB", colour="white", bins=40) + labs(title = 'Histogram of year_produced')

grid.arrange(plot3, plot4, plot5, plot6, ncol=2, nrow=2)

plot7 = ggplot(cars_numerical, aes(sample = engine_capacity)) + stat_qq(col="#00AFBB") + stat_qq_line() + labs(title = 'Q-Q plot of engine_capacity')
plot8 = ggplot(cars_numerical, aes(x = engine_capacity)) + geom_histogram(fill = "#00AFBB", colour="white", bins=40) + labs(title = 'Histogram of engine_capacity')

plot9 = ggplot(cars_numerical, aes(sample = number_of_photos)) + stat_qq(col="#00AFBB") + stat_qq_line() + labs(title = 'Q-Q plot of number_of_photos')
plot10 = ggplot(cars_numerical, aes(x = number_of_photos)) + geom_histogram(fill = "#00AFBB", colour="white", bins=40) + labs(title = 'Histogram of number_of_photos')

grid.arrange(plot7, plot8, plot9, plot10, ncol=2, nrow=2)

The Q-Q plots and histograms also show evidence of non-normality. The odometer_value, engine_capacity and number_of_photos are right-skewed, while year_produced is left-skewed.

Now let’s apply Kolmogorov-Smirnov normality test into the data. The null hypothesis of this test is ‘sample distribution is normal’.

ks.test(df$price, 'pnorm', mean=mean(df$price), sd=sd(df$price))

    One-sample Kolmogorov-Smirnov test

data:  df$price
D = 0.2, p-value <2e-16
alternative hypothesis: two-sided
ks.test(df$odometer_value, 'pnorm', mean=mean(df$odometer_value), sd=sd(df$odometer_value))

    One-sample Kolmogorov-Smirnov test

data:  df$odometer_value
D = 0.06, p-value <2e-16
alternative hypothesis: two-sided
ks.test(df$year_produced, 'pnorm', mean=mean(df$year_produced), sd=sd(df$year_produced))

    One-sample Kolmogorov-Smirnov test

data:  df$year_produced
D = 0.06, p-value <2e-16
alternative hypothesis: two-sided
ks.test(df$engine_capacity, 'pnorm', mean=mean(df$engine_capacity), sd=sd(df$engine_capacity))

    One-sample Kolmogorov-Smirnov test

data:  df$engine_capacity
D = 0.2, p-value <2e-16
alternative hypothesis: two-sided
ks.test(df$number_of_photos, 'pnorm', mean=mean(df$number_of_photos), sd=sd(df$number_of_photos))

    One-sample Kolmogorov-Smirnov test

data:  df$number_of_photos
D = 0.1, p-value <2e-16
alternative hypothesis: two-sided

The p-value of all the numeric variables are < 2e-16 which is less than 0.05, therefore it could be concluded that the distributions of all our numeric variables are significantly different from normal distribution. They have the same results with Q-Q plots and histograms.

Our sample size for this data is 38521. Based on the central limit theorem, the rest analysis will be generated using the original data.

Correlation Plot

corrplot(cor(cars_numerical), method = 'number')

Figure 1 shows the correlation between the numerical features.

The team used a correlation plot for checking the correlation between continuous variables. Year of production was highly correlated with price with correlation coefficient(cc)=0.7. Odometer value had a negative correlation with year produced (cc=-0.49) and price (cc=-0.42). Engine capacity also had a positive correlation with price (cc=0.30).

library(ggplot2)

df %>% group_by(year_produced) %>% summarize(mean_price_per_year = mean(price, na.rm=TRUE)) %>% ggplot(aes(x=year_produced,y=mean_price_per_year)) +  geom_col(fill = "#00AFBB") + labs(title='Avg Price of Car per Year', x="year produced", y = "mean price per year") + theme(plot.title = element_text(hjust = 0.5))

Figure 2 shows the average price of the car for each year produced between 1940 and 2020. The team observed that there is a steady decrease in the price as the car gets older. However around 1990, it could be observed that the prices spike as cars before 1990 fall under the classic or vintage category.

The bar plot of the average price of the car in different years showed that the vintage cars produced around the year 1965 are pricier than the newer cars. And the price increased steadily after around 1985.

df %>% group_by(engine_capacity) %>% summarize(mean_price_per_capicity = mean(price, na.rm=TRUE)) %>% ggplot(aes(x=engine_capacity,y=mean_price_per_capicity)) +  geom_point(color = "#00AFBB") + labs(title='Avg Price of Car for engine capacity', x='Engine Capacity', y='Mean Price') + theme(plot.title = element_text(hjust = 0.5))

Figure 3 shows the average price of the car for each engine capacity. The team observed a positive linear trend between the mean price per engine capacity and the capacity

df %>% group_by(engine_capacity) %>% summarize(mean_price_per_capacity = mean(price, na.rm=TRUE)) ->df4
cor(df4)
                        engine_capacity mean_price_per_capacity
engine_capacity                   1.000                   0.583
mean_price_per_capacity           0.583                   1.000
#corrplot(cor(cars_numerical), method = 'number')

The observed correlation coefficient equals 0.6. However, in Figure 1 it was observed that the correlation coefficient between price and engine capacity was 0.3. This trend could be explained by the outliers which are found in higher engine capacity.

df %>% group_by(engine_capacity) %>% summarize(mean_price_per_capacity = mean(price, na.rm=TRUE)) ->df4
cor(df4)
                        engine_capacity mean_price_per_capacity
engine_capacity                   1.000                   0.583
mean_price_per_capacity           0.583                   1.000
#corrplot(cor(cars_numerical), method = 'number')
df %>% ggplot(aes(x=reorder(body_type,-engine_capacity),y=engine_capacity, fill=body_type))+geom_boxplot() + labs(x='Body Type', y='Engine Capicity')  + ggtitle('Body Type vs Engine Capicity ') + theme(plot.title = element_text(hjust = 0.5))

Figure 4 shows the mean engine capacity for different body type using a box-plot. From the initial analysis the team observed for each of the groups there is a difference in median.

T test

When there are two samples drawn from the same population and the goal is to test whether the mean of respective two samples are the same, it is wise to perform the student-t test, or t-test in short. The reason team PatternSix did not choose the Z-test is that the team did not know the population standard deviation. Thus using t-test, team used sample standard deviation (s) to estimate the population parameter (σ).

Warranty vs Price

PatternSix tested some of the features against prices respectively since price is going to be the dependent variable. First one the team looked at is whether cars had warranties versus different average prices. A box-plot would help show the relationship between these two.

df %>% ggplot(aes(has_warranty, price, fill=has_warranty)) + geom_boxplot() + ggtitle('Has_Warranty vs Prices ') + theme(plot.title = element_text(hjust = 0.5))

From the graph, one could see that the average prices differ significantly between warrantied and non-warrantied cars.

The t-test was performed to verify the assumptions.

summary(df$has_warranty)
False  True 
38072   449 
has = subset(df, has_warranty == "True")
hasnot = subset(df, has_warranty == "False")
# has = subset(df, has_warranty == 1)
# hasnot = subset(df, has_warranty == 0)
t.test(x = has$price, y = hasnot$price, conf.level = 0.99)

    Welch Two Sample t-test

data:  has$price and hasnot$price
t = 37, df = 452, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
 15907 18304
sample estimates:
mean of x mean of y 
    23543      6438 

PatternSix subset the prices for cars based on whether they have warranties. The null hypothesis H0 is that μ1 = μ2. The alternative hypothesis H1 is μ1 <> μ2. From the result, because p-value is extremely low, team rejects the null hypothesis and concludes that whether cars have warranties does affect average price of cars.

Engine Types vs Price

Next, lets take a look at whether different engine types have different average prices. same as above, PatternSix drew a box-plot to get a visual idea.

df %>% ggplot(aes(engine_type, price,fill=engine_type)) + geom_boxplot()+ ggtitle('Engine_type vs Prices ') + theme(plot.title = element_text(hjust = 0.5))

This time, from the graph, PatternSix could not get a conclusion right away. That is why it is crucial to perform the formal test.

summary(df$engine_type)
  diesel electric gasoline 
   12874        0    25647 
diesel = subset(df, subset = df$engine_type == "diesel")
gas = subset(df, subset = df$engine_type == "gasoline")
t.test(x = diesel$price, y = gas$price, conf.level = 0.99)

    Welch Two Sample t-test

data:  diesel$price and gas$price
t = 16, df = 24452, p-value <2e-16
alternative hypothesis: true difference in means is not equal to 0
99 percent confidence interval:
  981 1344
sample estimates:
mean of x mean of y 
     7411      6249 

PatternSix subset prices for cars based on different engine types. The null hypothesis H0 is μ1 = μ2. The null hypothesis is μ1 \(\neq\) μ2.

Surprisingly, the p-value is extremely low, which tells the team to reject the null hypothesis and conclude for different engine types, their average prices do differ.

\(Chi^2\) test

In the data set, not only do there are numerical variables,but there are also categorical variables. For categorical variables, data set does not fit the requirements for goodness of fit test but the data has to be tested for co-linearity between categorical variables for variable selection in model building. Test of Independence thus is performed.

contgcTbl1 = table(df$manufacturer_name, df$has_warranty)

(Xsq1 = chisq.test(contgcTbl1))

    Pearson's Chi-squared test

data:  contgcTbl1
X-squared = 10446, df = 54, p-value <2e-16
contgcTbl2 = table(df$manufacturer_name, df$body_type)

(Xsq2 = chisq.test(contgcTbl2))

    Pearson's Chi-squared test

data:  contgcTbl2
X-squared = 35332, df = 594, p-value <2e-16
contgcTbl3 = table(df$manufacturer_name, df$color)

(Xsq3 = chisq.test(contgcTbl3))

    Pearson's Chi-squared test

data:  contgcTbl3
X-squared = 6103, df = 594, p-value <2e-16
contgcTbl4 = table(df$color, df$transmission)

(Xsq4 = chisq.test(contgcTbl4))

    Pearson's Chi-squared test

data:  contgcTbl4
X-squared = 2381, df = 11, p-value <2e-16
contgcTbl5 = table(df$manufacturer_name, df$is_exchangeable)
(Xsq5 = chisq.test(contgcTbl5))

    Pearson's Chi-squared test

data:  contgcTbl5
X-squared = 436, df = 54, p-value <2e-16

The pairs that were chosen here are different manufacturers versus whether cars have warranties, different body types, different colors and whether cars are exchangeable, respectively. In addition, the test between different colors and whether the car is automatic or manual is also conducted. To make presenting results easier, these tests are assigned as 1, 2, 3, 4, 5 respectively. One thing to note here is that for the last test, to put which variable in row position or column position does not matter as a result of non casualty between them.

PatternSix’s null hypotheses are that all pairs are independent. Interestingly, wide range of results can be observed. For test 1, 2, 3, a warning that the chi-square test approximation might be incorrect pops up. The reason for that is to use the test of independence, sample size has to be large enough. General rule is that if expected frequencies for 20% of the categories are less than 5,it can’t be used to test independence. That is exactly what happened here. As a result, these test results can’t be used.

For test 4, between different manufacturers and whether cars are exchangeable, and for test 5, between different colors and whether the car is automatic or manual, the results are acceptable. Due to low p-values in both tests, the null hypothesis has been rejected, which means for test 4 and 5 testing pairs, they are not independent.

ANOVA

Due to the fact that there are numerous independent variables to test on, in order to improve efficiency, ANOVA was performed.

Same as above, a graph would give the observer an overview of relationships against prices.

Colors by Mean Price

df %>% group_by(color) %>% summarise(price_colorMean=mean(price)) %>% ggplot(aes(x=reorder(color,-price_colorMean),y=price_colorMean)) + geom_col(fill = "#00AFBB") + labs(x='Color',y='Price mean') + ggtitle('Color vs Prices ') + theme(plot.title = element_text(hjust = 0.5))

Body Types by Mean Price

df %>% group_by(body_type) %>% summarise(body_price_mean = mean(price))%>% ggplot(aes(x = reorder(body_type, -body_price_mean),body_price_mean))+geom_col(fill = "#00AFBB") + labs(x='Body Type', y='Mean of price') + ggtitle('Body Type vs Price ') + theme(plot.title = element_text(hjust = 0.5))

Top 10 Manufacturers by Mean Price

df2 = df %>% group_by(manufacturer_name) %>% summarise(manuf_price_mean = mean(price)) %>% arrange(desc(manuf_price_mean)) 
df2 %>% slice(1:10) %>%  ggplot(aes(x = reorder(manufacturer_name, -manuf_price_mean),manuf_price_mean))+geom_col(fill = "#00AFBB") + labs(x='Manufacturer', y='Mean of price')  + ggtitle('Manufacturer vs Price ') + theme(plot.title = element_text(hjust = 0.5))

Here there are three graphs, average prices for different colors, for different body types and for top ten manufacturers. The last one is showing limited data by reason of display limitations.

It could be seen that average price differences are all significant between groups in colors, body types and top ten manufacturers. Same as the t-test,a formal test should be performed to get correct conclusions.

One Way ANOVA

df_aov_1 = aov(price ~  color , df)
summary(df_aov_1)
               Df   Sum Sq  Mean Sq F value Pr(>F)    
color          11 1.70e+11 1.55e+10     420 <2e-16 ***
Residuals   38509 1.42e+12 3.69e+07                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
df_aov_2 = aov(price ~  manufacturer_name , df)
summary(df_aov_2)
                     Df   Sum Sq  Mean Sq F value Pr(>F)    
manufacturer_name    54 2.94e+11 5.45e+09     162 <2e-16 ***
Residuals         38466 1.30e+12 3.37e+07                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
df_aov_2 = aov(price ~  body_type , df)
summary(df_aov_2)
               Df   Sum Sq  Mean Sq F value Pr(>F)    
body_type      11 3.50e+11 3.18e+10     988 <2e-16 ***
Residuals   38509 1.24e+12 3.22e+07                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Pairs that were chosen here are prices versus different colors, different manufacturers and different body types, respectively. PatternSix’s null hypotheses are that for all pairs, they are independent, same as the \(Chi^2\) test. Because there are multiple categories for categorical variables for this test, the alternative hypotheses are that all these categories are not all same.

For all three cases, in accordance with the extreme low p-values, the null hypotheses is rejected, which means all categories are not all the same within a test.

The Tukey test has been performed in this data set. However, due to excessive levels in categorical variables, it is impractical to incorporate it into the report.

Conclusion and Discussions

Overall, PatternSix’s work involved removing the null values for data pre-processing, data exploratory, normality check, finding the correlation between continuous variables, and finding the mean price difference between multiple categorical variables. The technologies used included a table summary, normality tests, t-test, ANOVA, and Chi-square test. The team used a variety of plots such as bar plot, scatter plot, box plot, Q-Q plot, and histogram to support different tests.

For more details, PatternSix deleted ten null values in the data pre-processing part. Then the team generated a table to show the basic statistical measurements of numeric data. The price of this data offers mean=6640 and standard deviation=6430. The other two measurements that may be considered are skewness and kurtosis. These two statistical values indicated that the data were highly skewed.

Based on these results, PatternSix checked the normality of continuous data by using Q-Q plot, histogram, and Kolmogorov-Smirnov normality test. The normality tests showed significant evidence to reject the null hypothesis. Thus, the price was not a normal distribution. The other continuous variables showed the same results. Therefore, for the future work, if PatternSix needs to use price as the dependent variable to create a regression, the team will transform the data to a normal distribution.

The team used a correlation plot for checking the correlation between continuous variables. Year of production was highly correlated with price with correlation coefficient(cc)=0.7. Odometer value had a negative correlation with year produced (cc=-0.49) and price (cc=-0.42). Engine capacity also had a positive correlation with price (cc=0.30).

After that the team generated other exploratory data analysis for the feature that the team was more concerned about – price.

The bar plot of the average price of the car in different years showed that the vintage cars produced around the year 1965 are pricier than the newer cars. And the price increased steadily after around 1985. The box plots and t-tests suggested the solid statistical significance of the difference between the mean price of vehicles with a warranty and without warranty and diesel and gasoline engine types. In the analysis, one-way and two-way ANOVA were used to check the difference between more than three levels of categorical data and price. The results suggested that color, manufacturer name, and body type had mean price differences.

According to the above analysis, the features that influence the prices of cars in the used car market in Belarus are year of production, body type, manufacture name, engine capacity, odometer value, engine type color, and transmission.

After conducting the EDA and hypothesis tests on the data, the team has concluded that the initial SMART research question were successful answered.

PatternSix’s future work for this topic is building up a model to predict the price based on the analysis that was explored to provide more effective decision-making services for future vehicle buyers and sellers.

Bibliography

Ben Ellencweig, Sam Ezratty, Dan Fleming, and Itai Miller. (2019, June 6). Mckinsey & Company. Retrieved from Mckinsey & Company Website:
https://www.mckinsey.com/industries/automotive-and-assembly/our-insights/used-cars-new-platforms-accelerating-sales-in-a-digitally-disrupted-market

Isidore, C. (2021, September 28). Retrieved from CNN Business: https://www.wraltechwire.com/2021/09/28/bad-news-car-buyers-chip-shortage-supply-chain-woes-are-worse-than-we-thought/

Feautre selection

#Not working

# library("leaps")

# reg.best10 <- regsubsets(price~. , data = df, nvmax = 10, nbest = 1, method = "backward", really.big=T)  
# 
# # leaps::regsubsets() - Model selection by exhaustive (default) search, forward or backward stepwise, or sequential replacement
# #The plot will show the Adjust R^2 when using the variables across the bottom
# plot(reg.best10, scale = "adjr2", main = "Adjusted R^2")
# plot(reg.best10, scale = "r2", main = "R^2")
# # In the "leaps" package, we can use scale=c("bic","Cp","adjr2","r2")
# plot(reg.best10, scale = "bic", main = "BIC")
# plot(reg.best10, scale = "Cp", main = "Cp")
# summary(reg.best10)
# install.packages('party')
# library(party)
# cf1 <- cforest(price~. , data = df, control=cforest_unbiased(mtry=2,ntree=10))
# install.packages('relaimpo')
# library(relaimpo)
# lmMod <- lm(price~. , data = df, )  # fit lm() model
# relImportance <- calc.relimp(lmMod, type = "lmg", rela = TRUE) 
# library(caret)
# library(mlbench)
# 
# 
# test = df$price
# train = subset(df,select=-c(price,model_name,color,manufacturer_name))
# 
# 
# control <- rfeControl(functions=rfFuncs, method="cv", number=10)
# # run the RFE algorithm
# results <- rfe(train, test, rfeControl=control)
# # summarize the results
# print(results)
# # list the chosen features
# predictors(results)
# # plot the results
# plot(results, type=c("g", "o"))

Data Transformation

We can observe that the target variable is nor normally distributed which is an assumption of the linear regression model Let try the box-Cox transformation to normalize the target variable.

library(MASS)
cal.box <- boxcox(price~manufacturer_name+color+transmission+odometer_value+engine_fuel+engine_capacity+body_type+has_warranty+state+drivetrain+is_exchangeable+number_of_photos+state+up_counter+year_produced, data = df)

power <- cal.box$x[cal.box$y==max(cal.box$y)]
power
[1] 0.222

Looks like a transformation close to 0.22 might be useful. Another advantage of transforming the response could be that it may lead to a model that predicts positive result.

price_normal = (df$price^power-1)/power
data_frame(val=price_normal) %>% ggplot(aes(val)) + geom_density()

df <- cbind(df,price_normal)

Models

Taking all the features except model_name

y = df$price
fit1 = lm(price~manufacturer_name+color+transmission+odometer_value+engine_fuel+engine_capacity+body_type+has_warranty+state+drivetrain+is_exchangeable+number_of_photos+state+up_counter+year_produced, data = df)
summary(fit1)

Call:
lm(formula = price ~ manufacturer_name + color + transmission + 
    odometer_value + engine_fuel + engine_capacity + body_type + 
    has_warranty + state + drivetrain + is_exchangeable + number_of_photos + 
    state + up_counter + year_produced, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-13519  -1626   -259   1100  36424 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    -8.33e+05   6.14e+03 -135.67  < 2e-16 ***
manufacturer_nameAlfa Romeo    -2.14e+03   4.67e+02   -4.60  4.3e-06 ***
manufacturer_nameAudi           3.96e+02   4.11e+02    0.96  0.33597    
manufacturer_nameBMW            8.22e+02   4.13e+02    1.99  0.04653 *  
manufacturer_nameBuick         -3.27e+03   6.30e+02   -5.19  2.1e-07 ***
manufacturer_nameCadillac      -2.71e+03   6.48e+02   -4.18  2.9e-05 ***
manufacturer_nameChery         -6.82e+03   5.94e+02  -11.49  < 2e-16 ***
manufacturer_nameChevrolet     -3.20e+03   4.35e+02   -7.35  2.0e-13 ***
manufacturer_nameChrysler      -2.65e+03   4.38e+02   -6.04  1.5e-09 ***
manufacturer_nameCitroen       -2.49e+03   4.16e+02   -5.98  2.2e-09 ***
manufacturer_nameDacia         -4.18e+03   5.91e+02   -7.07  1.5e-12 ***
manufacturer_nameDaewoo        -3.63e+03   4.64e+02   -7.83  5.1e-15 ***
manufacturer_nameDodge         -2.98e+03   4.49e+02   -6.63  3.4e-11 ***
manufacturer_nameFiat          -2.46e+03   4.24e+02   -5.79  6.9e-09 ***
manufacturer_nameFord          -1.96e+03   4.12e+02   -4.76  1.9e-06 ***
manufacturer_nameGeely         -5.96e+03   5.64e+02  -10.56  < 2e-16 ***
manufacturer_nameGreat Wall    -7.31e+03   6.83e+02  -10.69  < 2e-16 ***
manufacturer_nameHonda         -1.56e+03   4.23e+02   -3.68  0.00023 ***
manufacturer_nameHyundai       -2.44e+03   4.18e+02   -5.85  5.1e-09 ***
manufacturer_nameInfiniti      -1.03e+03   4.80e+02   -2.14  0.03214 *  
manufacturer_nameIveco          1.23e+03   5.04e+02    2.45  0.01448 *  
manufacturer_nameJaguar         5.21e+03   6.08e+02    8.57  < 2e-16 ***
manufacturer_nameJeep          -3.29e+03   5.15e+02   -6.38  1.8e-10 ***
manufacturer_nameKia           -2.68e+03   4.20e+02   -6.38  1.8e-10 ***
manufacturer_nameLADA          -5.25e+03   4.91e+02  -10.70  < 2e-16 ***
manufacturer_nameLancia        -2.36e+03   5.32e+02   -4.43  9.5e-06 ***
manufacturer_nameLand Rover     2.16e+01   4.73e+02    0.05  0.96361    
manufacturer_nameLexus          3.59e+03   4.65e+02    7.72  1.2e-14 ***
manufacturer_nameLifan         -5.76e+03   6.30e+02   -9.15  < 2e-16 ***
manufacturer_nameLincoln       -1.81e+03   7.20e+02   -2.51  0.01203 *  
manufacturer_nameMazda         -1.78e+03   4.16e+02   -4.29  1.8e-05 ***
manufacturer_nameMercedes-Benz  8.90e+02   4.14e+02    2.15  0.03130 *  
manufacturer_nameMini           7.79e+02   5.71e+02    1.36  0.17268    
manufacturer_nameMitsubishi    -2.34e+03   4.21e+02   -5.56  2.7e-08 ***
manufacturer_nameNissan        -2.27e+03   4.16e+02   -5.46  4.7e-08 ***
manufacturer_nameOpel          -2.02e+03   4.12e+02   -4.91  9.1e-07 ***
manufacturer_namePeugeot       -2.42e+03   4.14e+02   -5.84  5.3e-09 ***
manufacturer_namePontiac       -2.40e+03   6.49e+02   -3.69  0.00022 ***
manufacturer_namePorsche        2.38e+03   5.85e+02    4.07  4.7e-05 ***
manufacturer_nameRenault       -2.53e+03   4.13e+02   -6.13  9.1e-10 ***
manufacturer_nameRover         -2.10e+03   4.60e+02   -4.55  5.3e-06 ***
manufacturer_nameSaab          -1.88e+03   5.14e+02   -3.66  0.00026 ***
manufacturer_nameSeat          -1.54e+03   4.49e+02   -3.42  0.00063 ***
manufacturer_nameSkoda         -2.07e+03   4.26e+02   -4.86  1.2e-06 ***
manufacturer_nameSsangYong     -5.77e+03   5.50e+02  -10.48  < 2e-16 ***
manufacturer_nameSubaru        -3.05e+03   4.52e+02   -6.76  1.4e-11 ***
manufacturer_nameSuzuki        -3.53e+03   4.60e+02   -7.68  1.7e-14 ***
manufacturer_nameToyota         4.08e+01   4.16e+02    0.10  0.92186    
manufacturer_nameVolkswagen    -5.84e+02   4.10e+02   -1.42  0.15472    
manufacturer_nameVolvo         -7.26e+02   4.23e+02   -1.72  0.08623 .  
manufacturer_nameВАЗ           -1.78e+03   4.36e+02   -4.09  4.4e-05 ***
manufacturer_nameГАЗ            1.62e+03   4.73e+02    3.43  0.00060 ***
manufacturer_nameЗАЗ           -3.03e+03   6.52e+02   -4.64  3.5e-06 ***
manufacturer_nameМосквич        3.96e+03   6.10e+02    6.49  8.5e-11 ***
manufacturer_nameУАЗ           -6.97e+03   5.61e+02  -12.44  < 2e-16 ***
colorblue                      -3.15e+02   5.97e+01   -5.28  1.3e-07 ***
colorbrown                      8.72e+02   1.18e+02    7.41  1.3e-13 ***
colorgreen                     -2.73e+02   7.67e+01   -3.55  0.00038 ***
colorgrey                       7.67e+01   6.61e+01    1.16  0.24591    
colororange                     3.48e+02   2.49e+02    1.40  0.16191    
colorother                     -3.00e+02   7.49e+01   -4.00  6.2e-05 ***
colorred                        1.13e+02   7.47e+01    1.52  0.12892    
colorsilver                    -7.77e+02   5.60e+01  -13.89  < 2e-16 ***
colorviolet                    -2.68e+02   1.59e+02   -1.69  0.09085 .  
colorwhite                      3.74e+02   6.65e+01    5.62  2.0e-08 ***
coloryellow                    -6.49e+01   1.96e+02   -0.33  0.74074    
transmissionmechanical         -7.56e+02   4.79e+01  -15.78  < 2e-16 ***
odometer_value                 -5.82e-03   1.57e-04  -37.01  < 2e-16 ***
engine_fuelgas                 -1.19e+03   9.69e+01  -12.26  < 2e-16 ***
engine_fuelgasoline            -8.68e+02   4.27e+01  -20.32  < 2e-16 ***
engine_fuelhybrid-diesel        3.09e+03   2.33e+03    1.33  0.18452    
engine_fuelhybrid-petrol       -2.31e+02   2.29e+02   -1.01  0.31394    
engine_capacity                 8.27e+02   3.71e+01   22.32  < 2e-16 ***
body_typecoupe                 -2.09e+03   4.01e+02   -5.21  1.9e-07 ***
body_typehatchback             -3.93e+03   3.83e+02  -10.25  < 2e-16 ***
body_typeliftback              -2.79e+03   4.09e+02   -6.81  1.0e-11 ***
body_typelimousine             -2.16e+03   1.10e+03   -1.96  0.04988 *  
body_typeminibus               -1.03e+03   3.93e+02   -2.61  0.00906 ** 
body_typeminivan               -2.97e+03   3.86e+02   -7.69  1.5e-14 ***
body_typepickup                 1.13e+02   4.82e+02    0.23  0.81446    
body_typesedan                 -3.73e+03   3.82e+02   -9.77  < 2e-16 ***
body_typesuv                   -1.30e+03   3.87e+02   -3.35  0.00082 ***
body_typeuniversal             -3.86e+03   3.84e+02  -10.04  < 2e-16 ***
body_typevan                   -2.68e+03   4.02e+02   -6.66  2.7e-11 ***
has_warrantyTrue                1.42e+03   2.98e+02    4.75  2.0e-06 ***
statenew                        9.95e+03   3.49e+02   28.55  < 2e-16 ***
stateowned                      9.66e+02   1.73e+02    5.59  2.2e-08 ***
drivetrainfront                -2.16e+03   8.11e+01  -26.66  < 2e-16 ***
drivetrainrear                 -2.06e+03   9.54e+01  -21.58  < 2e-16 ***
is_exchangeableTrue            -1.97e+02   3.61e+01   -5.45  5.0e-08 ***
number_of_photos                7.40e+01   2.92e+00   25.37  < 2e-16 ***
up_counter                      2.48e+00   3.93e-01    6.32  2.7e-10 ***
year_produced                   4.22e+02   3.03e+00  139.00  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3280 on 38428 degrees of freedom
Multiple R-squared:  0.74,  Adjusted R-squared:  0.739 
F-statistic: 1.19e+03 on 92 and 38428 DF,  p-value: <2e-16
plot(fit1,which=1)

library(car)
vif(fit1)
                   GVIF Df GVIF^(1/(2*Df))
manufacturer_name 20.43 54            1.03
color              1.53 11            1.02
transmission       1.83  1            1.35
odometer_value     1.64  1            1.28
engine_fuel        1.64  4            1.06
engine_capacity    2.21  1            1.49
body_type          8.71 11            1.10
has_warranty       3.65  1            1.91
state              3.73  2            1.39
drivetrain         6.17  2            1.58
is_exchangeable    1.06  1            1.03
number_of_photos   1.13  1            1.06
up_counter         1.03  1            1.02
year_produced      2.14  1            1.46

Polynomial Terms

This has improved the fit but it will see whether changing some of the covariates can be helpful. The linearity assumption is checked by plots of residuals versus fitted values, plots of residuals versus explanatory variables, partial regression plots or gam plots. Here PatternSix apply the gam plot for our continuous variables to find the suitable transformation of the variables follow a linear model. The gam plots (one per explanatory variable) gives an idea which variables to transform: if the gam plot for a variable is straight, it suggests to leave that variable untransformed. If the plot for a particular variable is non-linear, the shape of the plot suggests the form of the transformation.

library(mgcv)
cal.gam <- gam(price_normal~s(odometer_value)+s(year_produced)+s(engine_capacity)+s(number_of_photos)+s(up_counter), data=df)

summary(fit1)

Call:
lm(formula = price ~ manufacturer_name + color + transmission + 
    odometer_value + engine_fuel + engine_capacity + body_type + 
    has_warranty + state + drivetrain + is_exchangeable + number_of_photos + 
    state + up_counter + year_produced, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-13519  -1626   -259   1100  36424 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                    -8.33e+05   6.14e+03 -135.67  < 2e-16 ***
manufacturer_nameAlfa Romeo    -2.14e+03   4.67e+02   -4.60  4.3e-06 ***
manufacturer_nameAudi           3.96e+02   4.11e+02    0.96  0.33597    
manufacturer_nameBMW            8.22e+02   4.13e+02    1.99  0.04653 *  
manufacturer_nameBuick         -3.27e+03   6.30e+02   -5.19  2.1e-07 ***
manufacturer_nameCadillac      -2.71e+03   6.48e+02   -4.18  2.9e-05 ***
manufacturer_nameChery         -6.82e+03   5.94e+02  -11.49  < 2e-16 ***
manufacturer_nameChevrolet     -3.20e+03   4.35e+02   -7.35  2.0e-13 ***
manufacturer_nameChrysler      -2.65e+03   4.38e+02   -6.04  1.5e-09 ***
manufacturer_nameCitroen       -2.49e+03   4.16e+02   -5.98  2.2e-09 ***
manufacturer_nameDacia         -4.18e+03   5.91e+02   -7.07  1.5e-12 ***
manufacturer_nameDaewoo        -3.63e+03   4.64e+02   -7.83  5.1e-15 ***
manufacturer_nameDodge         -2.98e+03   4.49e+02   -6.63  3.4e-11 ***
manufacturer_nameFiat          -2.46e+03   4.24e+02   -5.79  6.9e-09 ***
manufacturer_nameFord          -1.96e+03   4.12e+02   -4.76  1.9e-06 ***
manufacturer_nameGeely         -5.96e+03   5.64e+02  -10.56  < 2e-16 ***
manufacturer_nameGreat Wall    -7.31e+03   6.83e+02  -10.69  < 2e-16 ***
manufacturer_nameHonda         -1.56e+03   4.23e+02   -3.68  0.00023 ***
manufacturer_nameHyundai       -2.44e+03   4.18e+02   -5.85  5.1e-09 ***
manufacturer_nameInfiniti      -1.03e+03   4.80e+02   -2.14  0.03214 *  
manufacturer_nameIveco          1.23e+03   5.04e+02    2.45  0.01448 *  
manufacturer_nameJaguar         5.21e+03   6.08e+02    8.57  < 2e-16 ***
manufacturer_nameJeep          -3.29e+03   5.15e+02   -6.38  1.8e-10 ***
manufacturer_nameKia           -2.68e+03   4.20e+02   -6.38  1.8e-10 ***
manufacturer_nameLADA          -5.25e+03   4.91e+02  -10.70  < 2e-16 ***
manufacturer_nameLancia        -2.36e+03   5.32e+02   -4.43  9.5e-06 ***
manufacturer_nameLand Rover     2.16e+01   4.73e+02    0.05  0.96361    
manufacturer_nameLexus          3.59e+03   4.65e+02    7.72  1.2e-14 ***
manufacturer_nameLifan         -5.76e+03   6.30e+02   -9.15  < 2e-16 ***
manufacturer_nameLincoln       -1.81e+03   7.20e+02   -2.51  0.01203 *  
manufacturer_nameMazda         -1.78e+03   4.16e+02   -4.29  1.8e-05 ***
manufacturer_nameMercedes-Benz  8.90e+02   4.14e+02    2.15  0.03130 *  
manufacturer_nameMini           7.79e+02   5.71e+02    1.36  0.17268    
manufacturer_nameMitsubishi    -2.34e+03   4.21e+02   -5.56  2.7e-08 ***
manufacturer_nameNissan        -2.27e+03   4.16e+02   -5.46  4.7e-08 ***
manufacturer_nameOpel          -2.02e+03   4.12e+02   -4.91  9.1e-07 ***
manufacturer_namePeugeot       -2.42e+03   4.14e+02   -5.84  5.3e-09 ***
manufacturer_namePontiac       -2.40e+03   6.49e+02   -3.69  0.00022 ***
manufacturer_namePorsche        2.38e+03   5.85e+02    4.07  4.7e-05 ***
manufacturer_nameRenault       -2.53e+03   4.13e+02   -6.13  9.1e-10 ***
manufacturer_nameRover         -2.10e+03   4.60e+02   -4.55  5.3e-06 ***
manufacturer_nameSaab          -1.88e+03   5.14e+02   -3.66  0.00026 ***
manufacturer_nameSeat          -1.54e+03   4.49e+02   -3.42  0.00063 ***
manufacturer_nameSkoda         -2.07e+03   4.26e+02   -4.86  1.2e-06 ***
manufacturer_nameSsangYong     -5.77e+03   5.50e+02  -10.48  < 2e-16 ***
manufacturer_nameSubaru        -3.05e+03   4.52e+02   -6.76  1.4e-11 ***
manufacturer_nameSuzuki        -3.53e+03   4.60e+02   -7.68  1.7e-14 ***
manufacturer_nameToyota         4.08e+01   4.16e+02    0.10  0.92186    
manufacturer_nameVolkswagen    -5.84e+02   4.10e+02   -1.42  0.15472    
manufacturer_nameVolvo         -7.26e+02   4.23e+02   -1.72  0.08623 .  
manufacturer_nameВАЗ           -1.78e+03   4.36e+02   -4.09  4.4e-05 ***
manufacturer_nameГАЗ            1.62e+03   4.73e+02    3.43  0.00060 ***
manufacturer_nameЗАЗ           -3.03e+03   6.52e+02   -4.64  3.5e-06 ***
manufacturer_nameМосквич        3.96e+03   6.10e+02    6.49  8.5e-11 ***
manufacturer_nameУАЗ           -6.97e+03   5.61e+02  -12.44  < 2e-16 ***
colorblue                      -3.15e+02   5.97e+01   -5.28  1.3e-07 ***
colorbrown                      8.72e+02   1.18e+02    7.41  1.3e-13 ***
colorgreen                     -2.73e+02   7.67e+01   -3.55  0.00038 ***
colorgrey                       7.67e+01   6.61e+01    1.16  0.24591    
colororange                     3.48e+02   2.49e+02    1.40  0.16191    
colorother                     -3.00e+02   7.49e+01   -4.00  6.2e-05 ***
colorred                        1.13e+02   7.47e+01    1.52  0.12892    
colorsilver                    -7.77e+02   5.60e+01  -13.89  < 2e-16 ***
colorviolet                    -2.68e+02   1.59e+02   -1.69  0.09085 .  
colorwhite                      3.74e+02   6.65e+01    5.62  2.0e-08 ***
coloryellow                    -6.49e+01   1.96e+02   -0.33  0.74074    
transmissionmechanical         -7.56e+02   4.79e+01  -15.78  < 2e-16 ***
odometer_value                 -5.82e-03   1.57e-04  -37.01  < 2e-16 ***
engine_fuelgas                 -1.19e+03   9.69e+01  -12.26  < 2e-16 ***
engine_fuelgasoline            -8.68e+02   4.27e+01  -20.32  < 2e-16 ***
engine_fuelhybrid-diesel        3.09e+03   2.33e+03    1.33  0.18452    
engine_fuelhybrid-petrol       -2.31e+02   2.29e+02   -1.01  0.31394    
engine_capacity                 8.27e+02   3.71e+01   22.32  < 2e-16 ***
body_typecoupe                 -2.09e+03   4.01e+02   -5.21  1.9e-07 ***
body_typehatchback             -3.93e+03   3.83e+02  -10.25  < 2e-16 ***
body_typeliftback              -2.79e+03   4.09e+02   -6.81  1.0e-11 ***
body_typelimousine             -2.16e+03   1.10e+03   -1.96  0.04988 *  
body_typeminibus               -1.03e+03   3.93e+02   -2.61  0.00906 ** 
body_typeminivan               -2.97e+03   3.86e+02   -7.69  1.5e-14 ***
body_typepickup                 1.13e+02   4.82e+02    0.23  0.81446    
body_typesedan                 -3.73e+03   3.82e+02   -9.77  < 2e-16 ***
body_typesuv                   -1.30e+03   3.87e+02   -3.35  0.00082 ***
body_typeuniversal             -3.86e+03   3.84e+02  -10.04  < 2e-16 ***
body_typevan                   -2.68e+03   4.02e+02   -6.66  2.7e-11 ***
has_warrantyTrue                1.42e+03   2.98e+02    4.75  2.0e-06 ***
statenew                        9.95e+03   3.49e+02   28.55  < 2e-16 ***
stateowned                      9.66e+02   1.73e+02    5.59  2.2e-08 ***
drivetrainfront                -2.16e+03   8.11e+01  -26.66  < 2e-16 ***
drivetrainrear                 -2.06e+03   9.54e+01  -21.58  < 2e-16 ***
is_exchangeableTrue            -1.97e+02   3.61e+01   -5.45  5.0e-08 ***
number_of_photos                7.40e+01   2.92e+00   25.37  < 2e-16 ***
up_counter                      2.48e+00   3.93e-01    6.32  2.7e-10 ***
year_produced                   4.22e+02   3.03e+00  139.00  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3280 on 38428 degrees of freedom
Multiple R-squared:  0.74,  Adjusted R-squared:  0.739 
F-statistic: 1.19e+03 on 92 and 38428 DF,  p-value: <2e-16
par(mfrow=c(2,3))
plot(cal.gam)

These plots indicate that odometer_value and up_counter should remain untransformed (because the plot is relatively straight), but year_produced, engine_capacity and number_of_photos need transforming to generate a better result in the modeling. One possibility is to add a twice degree polynomial in year_produced, a sin transformation in engine_capacity and a twice degree polynomial in number_of_photos to the regression equation.

fit2 <- lm(price_normal ~ odometer_value+poly(year_produced, 2)+sin(engine_capacity)+poly(number_of_photos, 2)+up_counter
           +manufacturer_name+color+transmission+engine_fuel+body_type+has_warranty+state+drivetrain+is_exchangeable+state, data = df)
summary(fit2)

Call:
lm(formula = price_normal ~ odometer_value + poly(year_produced, 
    2) + sin(engine_capacity) + poly(number_of_photos, 2) + up_counter + 
    manufacturer_name + color + transmission + engine_fuel + 
    body_type + has_warranty + state + drivetrain + is_exchangeable + 
    state, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.829  -1.277   0.094   1.353  24.017 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     2.87e+01   3.96e-01   72.64  < 2e-16 ***
odometer_value                 -1.11e-06   1.09e-07  -10.15  < 2e-16 ***
poly(year_produced, 2)1         9.26e+02   3.27e+00  283.27  < 2e-16 ***
poly(year_produced, 2)2         2.52e+02   2.79e+00   90.29  < 2e-16 ***
sin(engine_capacity)           -1.78e+00   3.97e-02  -44.75  < 2e-16 ***
poly(number_of_photos, 2)1      5.61e+01   2.35e+00   23.91  < 2e-16 ***
poly(number_of_photos, 2)2     -8.40e+00   2.22e+00   -3.78  0.00016 ***
up_counter                      2.14e-03   2.64e-04    8.11  5.0e-16 ***
manufacturer_nameAlfa Romeo    -2.20e+00   3.14e-01   -7.02  2.3e-12 ***
manufacturer_nameAudi           6.10e-01   2.76e-01    2.21  0.02726 *  
manufacturer_nameBMW            4.56e-01   2.77e-01    1.64  0.10001    
manufacturer_nameBuick         -3.89e+00   4.23e-01   -9.20  < 2e-16 ***
manufacturer_nameCadillac      -1.31e+00   4.34e-01   -3.03  0.00246 ** 
manufacturer_nameChery         -7.12e+00   3.99e-01  -17.84  < 2e-16 ***
manufacturer_nameChevrolet     -3.29e+00   2.92e-01  -11.25  < 2e-16 ***
manufacturer_nameChrysler      -2.31e+00   2.94e-01   -7.86  3.9e-15 ***
manufacturer_nameCitroen       -2.42e+00   2.80e-01   -8.66  < 2e-16 ***
manufacturer_nameDacia         -4.54e+00   3.97e-01  -11.43  < 2e-16 ***
manufacturer_nameDaewoo        -5.28e+00   3.11e-01  -16.94  < 2e-16 ***
manufacturer_nameDodge         -2.82e+00   3.02e-01   -9.35  < 2e-16 ***
manufacturer_nameFiat          -3.25e+00   2.85e-01  -11.43  < 2e-16 ***
manufacturer_nameFord          -2.88e+00   2.77e-01  -10.39  < 2e-16 ***
manufacturer_nameGeely         -7.10e+00   3.79e-01  -18.73  < 2e-16 ***
manufacturer_nameGreat Wall    -5.51e+00   4.59e-01  -12.00  < 2e-16 ***
manufacturer_nameHonda         -6.41e-01   2.84e-01   -2.26  0.02405 *  
manufacturer_nameHyundai       -2.48e+00   2.81e-01   -8.83  < 2e-16 ***
manufacturer_nameInfiniti      -9.42e-01   3.23e-01   -2.92  0.00350 ** 
manufacturer_nameIveco          1.22e-01   3.39e-01    0.36  0.71783    
manufacturer_nameJaguar         1.34e+00   4.08e-01    3.28  0.00105 ** 
manufacturer_nameJeep          -2.08e+00   3.46e-01   -6.02  1.8e-09 ***
manufacturer_nameKia           -2.75e+00   2.83e-01   -9.73  < 2e-16 ***
manufacturer_nameLADA          -6.42e+00   3.30e-01  -19.43  < 2e-16 ***
manufacturer_nameLancia        -2.39e+00   3.57e-01   -6.67  2.5e-11 ***
manufacturer_nameLand Rover    -7.36e-01   3.18e-01   -2.32  0.02046 *  
manufacturer_nameLexus          1.39e+00   3.12e-01    4.46  8.3e-06 ***
manufacturer_nameLifan         -6.74e+00   4.24e-01  -15.91  < 2e-16 ***
manufacturer_nameLincoln       -1.40e+00   4.83e-01   -2.90  0.00378 ** 
manufacturer_nameMazda         -1.99e+00   2.80e-01   -7.10  1.3e-12 ***
manufacturer_nameMercedes-Benz  6.75e-02   2.78e-01    0.24  0.80813    
manufacturer_nameMini           3.60e-01   3.84e-01    0.94  0.34840    
manufacturer_nameMitsubishi    -2.18e+00   2.83e-01   -7.73  1.1e-14 ***
manufacturer_nameNissan        -2.37e+00   2.79e-01   -8.47  < 2e-16 ***
manufacturer_nameOpel          -2.07e+00   2.77e-01   -7.47  8.0e-14 ***
manufacturer_namePeugeot       -1.93e+00   2.78e-01   -6.94  3.9e-12 ***
manufacturer_namePontiac       -1.29e+00   4.36e-01   -2.96  0.00303 ** 
manufacturer_namePorsche       -6.45e-02   3.93e-01   -0.16  0.86955    
manufacturer_nameRenault       -2.93e+00   2.77e-01  -10.57  < 2e-16 ***
manufacturer_nameRover         -2.74e+00   3.09e-01   -8.87  < 2e-16 ***
manufacturer_nameSaab          -9.74e-01   3.46e-01   -2.82  0.00485 ** 
manufacturer_nameSeat          -1.91e+00   3.02e-01   -6.34  2.3e-10 ***
manufacturer_nameSkoda         -1.98e+00   2.86e-01   -6.93  4.4e-12 ***
manufacturer_nameSsangYong     -3.61e+00   3.70e-01   -9.76  < 2e-16 ***
manufacturer_nameSubaru        -1.28e+00   3.04e-01   -4.22  2.4e-05 ***
manufacturer_nameSuzuki        -2.71e+00   3.09e-01   -8.78  < 2e-16 ***
manufacturer_nameToyota         3.70e-02   2.80e-01    0.13  0.89486    
manufacturer_nameVolkswagen    -6.98e-01   2.76e-01   -2.53  0.01128 *  
manufacturer_nameVolvo         -3.83e-01   2.85e-01   -1.35  0.17783    
manufacturer_nameВАЗ           -4.79e+00   2.93e-01  -16.35  < 2e-16 ***
manufacturer_nameГАЗ           -3.29e+00   3.22e-01  -10.21  < 2e-16 ***
manufacturer_nameЗАЗ           -6.95e+00   4.38e-01  -15.89  < 2e-16 ***
manufacturer_nameМосквич       -3.57e+00   4.13e-01   -8.63  < 2e-16 ***
manufacturer_nameУАЗ           -6.03e+00   3.77e-01  -16.01  < 2e-16 ***
colorblue                      -2.40e-01   4.01e-02   -5.98  2.3e-09 ***
colorbrown                     -1.67e-01   7.92e-02   -2.11  0.03474 *  
colorgreen                     -3.27e-01   5.16e-02   -6.34  2.4e-10 ***
colorgrey                      -1.42e-01   4.44e-02   -3.20  0.00136 ** 
colororange                    -1.92e-01   1.67e-01   -1.15  0.25068    
colorother                     -1.76e-01   5.03e-02   -3.50  0.00047 ***
colorred                       -5.55e-01   5.01e-02  -11.08  < 2e-16 ***
colorsilver                    -1.37e-01   3.77e-02   -3.64  0.00027 ***
colorviolet                    -3.90e-01   1.07e-01   -3.66  0.00026 ***
colorwhite                     -6.24e-01   4.47e-02  -13.95  < 2e-16 ***
coloryellow                    -2.77e-01   1.32e-01   -2.10  0.03539 *  
transmissionmechanical         -6.98e-01   3.20e-02  -21.81  < 2e-16 ***
engine_fuelgas                 -1.21e+00   6.51e-02  -18.58  < 2e-16 ***
engine_fuelgasoline            -1.14e+00   2.86e-02  -39.82  < 2e-16 ***
engine_fuelhybrid-diesel        3.27e+00   1.56e+00    2.09  0.03626 *  
engine_fuelhybrid-petrol       -1.13e+00   1.54e-01   -7.34  2.1e-13 ***
body_typecoupe                 -1.81e+00   2.70e-01   -6.73  1.7e-11 ***
body_typehatchback             -3.78e+00   2.57e-01  -14.70  < 2e-16 ***
body_typeliftback              -2.96e+00   2.75e-01  -10.77  < 2e-16 ***
body_typelimousine             -5.31e-01   7.39e-01   -0.72  0.47247    
body_typeminibus               -2.56e-01   2.64e-01   -0.97  0.33203    
body_typeminivan               -1.76e+00   2.59e-01   -6.78  1.2e-11 ***
body_typepickup                -4.50e-01   3.23e-01   -1.39  0.16386    
body_typesedan                 -3.48e+00   2.56e-01  -13.59  < 2e-16 ***
body_typesuv                   -1.51e+00   2.60e-01   -5.82  6.0e-09 ***
body_typeuniversal             -3.39e+00   2.58e-01  -13.15  < 2e-16 ***
body_typevan                   -1.62e+00   2.70e-01   -5.99  2.1e-09 ***
has_warrantyTrue               -7.61e-01   2.00e-01   -3.80  0.00014 ***
statenew                        5.81e+00   2.35e-01   24.76  < 2e-16 ***
stateowned                      4.76e+00   1.16e-01   41.00  < 2e-16 ***
drivetrainfront                -1.04e+00   5.45e-02  -19.08  < 2e-16 ***
drivetrainrear                 -5.55e-01   6.43e-02   -8.62  < 2e-16 ***
is_exchangeableTrue            -1.77e-01   2.42e-02   -7.31  2.7e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.2 on 38426 degrees of freedom
Multiple R-squared:  0.883, Adjusted R-squared:  0.883 
F-statistic: 3.1e+03 on 94 and 38426 DF,  p-value: <2e-16
plot(fit2,which=1)

Interaction Terms

Checking for interaction term between year_produced^2 *odometer_value. Removing color

fit3 <- lm(price_normal ~ odometer_value*poly(year_produced, 2)+sin(engine_capacity)+poly(number_of_photos, 2)+up_counter
           +manufacturer_name+color+transmission+engine_fuel+body_type+has_warranty+state+drivetrain+is_exchangeable+state, data = df)
summary(fit3)

Call:
lm(formula = price_normal ~ odometer_value * poly(year_produced, 
    2) + sin(engine_capacity) + poly(number_of_photos, 2) + up_counter + 
    manufacturer_name + color + transmission + engine_fuel + 
    body_type + has_warranty + state + drivetrain + is_exchangeable + 
    state, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-23.80  -1.27   0.08   1.34  23.27 

Coefficients:
                                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             2.84e+01   3.93e-01   72.31  < 2e-16 ***
odometer_value                          1.29e-07   1.31e-07    0.99  0.32330    
poly(year_produced, 2)1                 8.66e+02   4.74e+00  182.67  < 2e-16 ***
poly(year_produced, 2)2                 2.88e+02   3.68e+00   78.25  < 2e-16 ***
sin(engine_capacity)                   -1.72e+00   3.95e-02  -43.44  < 2e-16 ***
poly(number_of_photos, 2)1              5.58e+01   2.33e+00   23.95  < 2e-16 ***
poly(number_of_photos, 2)2             -7.18e+00   2.21e+00   -3.25  0.00114 ** 
up_counter                              2.08e-03   2.62e-04    7.94  2.0e-15 ***
manufacturer_nameAlfa Romeo            -2.30e+00   3.11e-01   -7.38  1.6e-13 ***
manufacturer_nameAudi                   6.40e-01   2.74e-01    2.33  0.01962 *  
manufacturer_nameBMW                    4.18e-01   2.75e-01    1.52  0.12923    
manufacturer_nameBuick                 -3.76e+00   4.20e-01   -8.95  < 2e-16 ***
manufacturer_nameCadillac              -1.34e+00   4.31e-01   -3.11  0.00189 ** 
manufacturer_nameChery                 -7.09e+00   3.96e-01  -17.91  < 2e-16 ***
manufacturer_nameChevrolet             -3.25e+00   2.90e-01  -11.21  < 2e-16 ***
manufacturer_nameChrysler              -2.38e+00   2.92e-01   -8.16  3.4e-16 ***
manufacturer_nameCitroen               -2.50e+00   2.77e-01   -9.01  < 2e-16 ***
manufacturer_nameDacia                 -4.62e+00   3.94e-01  -11.74  < 2e-16 ***
manufacturer_nameDaewoo                -5.33e+00   3.09e-01  -17.25  < 2e-16 ***
manufacturer_nameDodge                 -2.90e+00   2.99e-01   -9.70  < 2e-16 ***
manufacturer_nameFiat                  -3.32e+00   2.82e-01  -11.75  < 2e-16 ***
manufacturer_nameFord                  -2.91e+00   2.75e-01  -10.57  < 2e-16 ***
manufacturer_nameGeely                 -7.02e+00   3.77e-01  -18.65  < 2e-16 ***
manufacturer_nameGreat Wall            -5.53e+00   4.56e-01  -12.13  < 2e-16 ***
manufacturer_nameHonda                 -7.18e-01   2.82e-01   -2.55  0.01091 *  
manufacturer_nameHyundai               -2.49e+00   2.79e-01   -8.95  < 2e-16 ***
manufacturer_nameInfiniti              -9.21e-01   3.20e-01   -2.88  0.00401 ** 
manufacturer_nameIveco                 -9.44e-02   3.36e-01   -0.28  0.77889    
manufacturer_nameJaguar                 1.40e+00   4.05e-01    3.46  0.00054 ***
manufacturer_nameJeep                  -2.11e+00   3.43e-01   -6.13  8.7e-10 ***
manufacturer_nameKia                   -2.76e+00   2.80e-01   -9.85  < 2e-16 ***
manufacturer_nameLADA                  -6.33e+00   3.28e-01  -19.26  < 2e-16 ***
manufacturer_nameLancia                -2.45e+00   3.55e-01   -6.92  4.7e-12 ***
manufacturer_nameLand Rover            -7.37e-01   3.15e-01   -2.34  0.01946 *  
manufacturer_nameLexus                  1.38e+00   3.10e-01    4.46  8.2e-06 ***
manufacturer_nameLifan                 -6.60e+00   4.21e-01  -15.70  < 2e-16 ***
manufacturer_nameLincoln               -1.41e+00   4.79e-01   -2.94  0.00333 ** 
manufacturer_nameMazda                 -2.03e+00   2.78e-01   -7.32  2.6e-13 ***
manufacturer_nameMercedes-Benz          6.73e-02   2.76e-01    0.24  0.80704    
manufacturer_nameMini                   3.67e-01   3.81e-01    0.96  0.33482    
manufacturer_nameMitsubishi            -2.24e+00   2.81e-01   -7.98  1.5e-15 ***
manufacturer_nameNissan                -2.40e+00   2.77e-01   -8.65  < 2e-16 ***
manufacturer_nameOpel                  -2.12e+00   2.75e-01   -7.72  1.2e-14 ***
manufacturer_namePeugeot               -2.00e+00   2.76e-01   -7.25  4.1e-13 ***
manufacturer_namePontiac               -1.38e+00   4.33e-01   -3.19  0.00143 ** 
manufacturer_namePorsche               -5.01e-02   3.90e-01   -0.13  0.89782    
manufacturer_nameRenault               -2.98e+00   2.75e-01  -10.81  < 2e-16 ***
manufacturer_nameRover                 -2.83e+00   3.07e-01   -9.21  < 2e-16 ***
manufacturer_nameSaab                  -1.06e+00   3.43e-01   -3.08  0.00205 ** 
manufacturer_nameSeat                  -1.98e+00   3.00e-01   -6.60  4.2e-11 ***
manufacturer_nameSkoda                 -2.04e+00   2.84e-01   -7.19  6.5e-13 ***
manufacturer_nameSsangYong             -3.65e+00   3.67e-01   -9.94  < 2e-16 ***
manufacturer_nameSubaru                -1.33e+00   3.02e-01   -4.41  1.0e-05 ***
manufacturer_nameSuzuki                -2.76e+00   3.07e-01   -9.00  < 2e-16 ***
manufacturer_nameToyota                -1.38e-02   2.78e-01   -0.05  0.96026    
manufacturer_nameVolkswagen            -7.11e-01   2.74e-01   -2.60  0.00934 ** 
manufacturer_nameVolvo                 -4.66e-01   2.82e-01   -1.65  0.09887 .  
manufacturer_nameВАЗ                   -4.91e+00   2.91e-01  -16.90  < 2e-16 ***
manufacturer_nameГАЗ                   -4.20e+00   3.22e-01  -13.03  < 2e-16 ***
manufacturer_nameЗАЗ                   -7.35e+00   4.35e-01  -16.90  < 2e-16 ***
manufacturer_nameМосквич               -4.83e+00   4.14e-01  -11.69  < 2e-16 ***
manufacturer_nameУАЗ                   -6.10e+00   3.74e-01  -16.32  < 2e-16 ***
colorblue                              -2.33e-01   3.99e-02   -5.85  4.9e-09 ***
colorbrown                             -1.04e-01   7.87e-02   -1.32  0.18647    
colorgreen                             -3.27e-01   5.14e-02   -6.37  1.9e-10 ***
colorgrey                              -1.19e-01   4.41e-02   -2.71  0.00681 ** 
colororange                            -1.10e-01   1.66e-01   -0.66  0.50607    
colorother                             -1.61e-01   4.99e-02   -3.22  0.00127 ** 
colorred                               -5.03e-01   4.98e-02  -10.10  < 2e-16 ***
colorsilver                            -1.59e-01   3.74e-02   -4.26  2.1e-05 ***
colorviolet                            -3.77e-01   1.06e-01   -3.57  0.00036 ***
colorwhite                             -5.73e-01   4.45e-02  -12.89  < 2e-16 ***
coloryellow                            -2.07e-01   1.31e-01   -1.58  0.11309    
transmissionmechanical                 -6.82e-01   3.18e-02  -21.49  < 2e-16 ***
engine_fuelgas                         -1.09e+00   6.49e-02  -16.87  < 2e-16 ***
engine_fuelgasoline                    -1.03e+00   2.89e-02  -35.83  < 2e-16 ***
engine_fuelhybrid-diesel                3.56e+00   1.55e+00    2.30  0.02156 *  
engine_fuelhybrid-petrol               -1.00e+00   1.53e-01   -6.57  5.2e-11 ***
body_typecoupe                         -1.81e+00   2.68e-01   -6.75  1.5e-11 ***
body_typehatchback                     -3.77e+00   2.55e-01  -14.76  < 2e-16 ***
body_typeliftback                      -2.94e+00   2.73e-01  -10.78  < 2e-16 ***
body_typelimousine                     -5.02e-01   7.34e-01   -0.68  0.49374    
body_typeminibus                       -3.08e-01   2.62e-01   -1.17  0.24015    
body_typeminivan                       -1.80e+00   2.57e-01   -7.00  2.7e-12 ***
body_typepickup                        -4.34e-01   3.21e-01   -1.35  0.17628    
body_typesedan                         -3.46e+00   2.54e-01  -13.60  < 2e-16 ***
body_typesuv                           -1.46e+00   2.58e-01   -5.64  1.7e-08 ***
body_typeuniversal                     -3.42e+00   2.56e-01  -13.38  < 2e-16 ***
body_typevan                           -1.59e+00   2.68e-01   -5.92  3.3e-09 ***
has_warrantyTrue                       -6.29e-01   1.99e-01   -3.15  0.00163 ** 
statenew                                6.03e+00   2.36e-01   25.56  < 2e-16 ***
stateowned                              4.73e+00   1.15e-01   41.11  < 2e-16 ***
drivetrainfront                        -1.02e+00   5.41e-02  -18.79  < 2e-16 ***
drivetrainrear                         -5.29e-01   6.38e-02   -8.28  < 2e-16 ***
is_exchangeableTrue                    -1.75e-01   2.41e-02   -7.29  3.2e-13 ***
odometer_value:poly(year_produced, 2)1  3.35e-04   2.10e-05   15.92  < 2e-16 ***
odometer_value:poly(year_produced, 2)2 -1.18e-04   1.77e-05   -6.67  2.6e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.19 on 38424 degrees of freedom
Multiple R-squared:  0.885, Adjusted R-squared:  0.885 
F-statistic: 3.09e+03 on 96 and 38424 DF,  p-value: <2e-16
plot(fit3,which=1)

Year_produced vs price

plot(df$year_produced, df$price, main="Year_produced vs price")

Year_produced vs price

plot(df$year_produced, df$price, main="Year_produced vs price", pch=19)

# Adding year_produced as polinomail of 2nd degree

fit2 <- lm(price ~ manufacturer_name+color+transmission+odometer_value+engine_fuel+engine_capacity+body_type+has_warranty+state+drivetrain+is_exchangeable+number_of_photos+state+up_counter+poly(year_produced,2), data = df)
summary(fit2)

Call:
lm(formula = price ~ manufacturer_name + color + transmission + 
    odometer_value + engine_fuel + engine_capacity + body_type + 
    has_warranty + state + drivetrain + is_exchangeable + number_of_photos + 
    state + up_counter + poly(year_produced, 2), data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-23983  -1333    -33   1100  30808 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     9.92e+03   5.31e+02   18.68  < 2e-16 ***
manufacturer_nameAlfa Romeo    -1.44e+03   4.10e+02   -3.52  0.00043 ***
manufacturer_nameAudi          -1.13e+02   3.61e+02   -0.31  0.75354    
manufacturer_nameBMW            8.98e+02   3.62e+02    2.48  0.01321 *  
manufacturer_nameBuick         -4.41e+03   5.53e+02   -7.98  1.6e-15 ***
manufacturer_nameCadillac      -3.16e+03   5.68e+02   -5.56  2.7e-08 ***
manufacturer_nameChery         -7.18e+03   5.21e+02  -13.77  < 2e-16 ***
manufacturer_nameChevrolet     -3.94e+03   3.82e+02  -10.30  < 2e-16 ***
manufacturer_nameChrysler      -2.53e+03   3.84e+02   -6.57  5.0e-11 ***
manufacturer_nameCitroen       -2.38e+03   3.65e+02   -6.52  6.9e-11 ***
manufacturer_nameDacia         -4.72e+03   5.19e+02   -9.10  < 2e-16 ***
manufacturer_nameDaewoo        -2.96e+03   4.07e+02   -7.27  3.6e-13 ***
manufacturer_nameDodge         -2.83e+03   3.94e+02   -7.17  7.7e-13 ***
manufacturer_nameFiat          -2.05e+03   3.72e+02   -5.52  3.3e-08 ***
manufacturer_nameFord          -2.21e+03   3.62e+02   -6.12  9.7e-10 ***
manufacturer_nameGeely         -7.64e+03   4.95e+02  -15.42  < 2e-16 ***
manufacturer_nameGreat Wall    -7.15e+03   6.00e+02  -11.91  < 2e-16 ***
manufacturer_nameHonda         -1.20e+03   3.71e+02   -3.23  0.00123 ** 
manufacturer_nameHyundai       -2.52e+03   3.67e+02   -6.87  6.5e-12 ***
manufacturer_nameInfiniti      -1.23e+03   4.22e+02   -2.93  0.00342 ** 
manufacturer_nameIveco          7.23e+02   4.42e+02    1.63  0.10224    
manufacturer_nameJaguar         4.69e+03   5.34e+02    8.79  < 2e-16 ***
manufacturer_nameJeep          -3.09e+03   4.52e+02   -6.84  8.0e-12 ***
manufacturer_nameKia           -2.91e+03   3.69e+02   -7.90  2.9e-15 ***
manufacturer_nameLADA          -7.95e+03   4.32e+02  -18.42  < 2e-16 ***
manufacturer_nameLancia        -1.77e+03   4.67e+02   -3.78  0.00015 ***
manufacturer_nameLand Rover     7.80e+01   4.15e+02    0.19  0.85095    
manufacturer_nameLexus          3.25e+03   4.08e+02    7.97  1.6e-15 ***
manufacturer_nameLifan         -7.79e+03   5.53e+02  -14.08  < 2e-16 ***
manufacturer_nameLincoln       -2.14e+03   6.32e+02   -3.38  0.00071 ***
manufacturer_nameMazda         -1.66e+03   3.65e+02   -4.55  5.4e-06 ***
manufacturer_nameMercedes-Benz  5.33e+02   3.63e+02    1.47  0.14163    
manufacturer_nameMini           3.03e+02   5.01e+02    0.60  0.54609    
manufacturer_nameMitsubishi    -2.05e+03   3.69e+02   -5.55  2.8e-08 ***
manufacturer_nameNissan        -2.25e+03   3.65e+02   -6.15  7.7e-10 ***
manufacturer_nameOpel          -1.96e+03   3.62e+02   -5.42  6.1e-08 ***
manufacturer_namePeugeot       -2.19e+03   3.63e+02   -6.02  1.8e-09 ***
manufacturer_namePontiac       -2.18e+03   5.70e+02   -3.82  0.00013 ***
manufacturer_namePorsche        1.98e+03   5.13e+02    3.86  0.00011 ***
manufacturer_nameRenault       -2.71e+03   3.62e+02   -7.47  8.3e-14 ***
manufacturer_nameRover         -1.40e+03   4.04e+02   -3.46  0.00054 ***
manufacturer_nameSaab          -1.55e+03   4.51e+02   -3.42  0.00062 ***
manufacturer_nameSeat          -1.32e+03   3.94e+02   -3.34  0.00085 ***
manufacturer_nameSkoda         -2.71e+03   3.74e+02   -7.25  4.4e-13 ***
manufacturer_nameSsangYong     -5.30e+03   4.83e+02  -10.98  < 2e-16 ***
manufacturer_nameSubaru        -3.00e+03   3.97e+02   -7.56  4.2e-14 ***
manufacturer_nameSuzuki        -2.96e+03   4.04e+02   -7.33  2.3e-13 ***
manufacturer_nameToyota         1.51e+01   3.65e+02    0.04  0.96712    
manufacturer_nameVolkswagen    -1.03e+03   3.60e+02   -2.85  0.00435 ** 
manufacturer_nameVolvo         -6.85e+02   3.72e+02   -1.84  0.06532 .  
manufacturer_nameВАЗ           -2.73e+03   3.83e+02   -7.14  9.8e-13 ***
manufacturer_nameГАЗ           -6.28e+03   4.21e+02  -14.92  < 2e-16 ***
manufacturer_nameЗАЗ           -4.43e+03   5.73e+02   -7.74  9.9e-15 ***
manufacturer_nameМосквич       -4.14e+03   5.40e+02   -7.67  1.8e-14 ***
manufacturer_nameУАЗ           -7.56e+03   4.92e+02  -15.37  < 2e-16 ***
colorblue                       1.13e+01   5.24e+01    0.22  0.82957    
colorbrown                      1.75e+02   1.03e+02    1.69  0.09013 .  
colorgreen                      1.83e+02   6.74e+01    2.72  0.00663 ** 
colorgrey                      -2.69e+01   5.80e+01   -0.46  0.64271    
colororange                    -8.50e+01   2.18e+02   -0.39  0.69696    
colorother                     -1.20e+02   6.58e+01   -1.82  0.06813 .  
colorred                        1.46e+02   6.55e+01    2.23  0.02565 *  
colorsilver                    -3.82e+02   4.92e+01   -7.76  8.9e-15 ***
colorviolet                     2.68e+02   1.39e+02    1.92  0.05476 .  
colorwhite                      5.62e+00   5.85e+01    0.10  0.92349    
coloryellow                    -1.45e+02   1.72e+02   -0.84  0.39993    
transmissionmechanical         -5.18e+02   4.21e+01  -12.30  < 2e-16 ***
odometer_value                 -1.97e-03   1.43e-04  -13.81  < 2e-16 ***
engine_fuelgas                 -1.04e+03   8.50e+01  -12.29  < 2e-16 ***
engine_fuelgasoline            -8.52e+02   3.75e+01  -22.72  < 2e-16 ***
engine_fuelhybrid-diesel        2.07e+03   2.04e+03    1.01  0.31115    
engine_fuelhybrid-petrol       -6.36e+02   2.01e+02   -3.16  0.00158 ** 
engine_capacity                 1.41e+03   3.30e+01   42.82  < 2e-16 ***
body_typecoupe                 -1.99e+03   3.52e+02   -5.66  1.5e-08 ***
body_typehatchback             -3.60e+03   3.36e+02  -10.69  < 2e-16 ***
body_typeliftback              -2.92e+03   3.59e+02   -8.14  4.1e-16 ***
body_typelimousine             -1.56e+03   9.66e+02   -1.61  0.10638    
body_typeminibus               -8.05e+02   3.45e+02   -2.33  0.01970 *  
body_typeminivan               -2.79e+03   3.39e+02   -8.24  < 2e-16 ***
body_typepickup                -1.39e+03   4.23e+02   -3.28  0.00104 ** 
body_typesedan                 -3.83e+03   3.35e+02  -11.43  < 2e-16 ***
body_typesuv                   -1.89e+03   3.40e+02   -5.55  2.9e-08 ***
body_typeuniversal             -3.61e+03   3.37e+02  -10.69  < 2e-16 ***
body_typevan                   -2.56e+03   3.53e+02   -7.24  4.5e-13 ***
has_warrantyTrue               -6.43e+01   2.62e+02   -0.25  0.80600    
statenew                        8.03e+03   3.07e+02   26.19  < 2e-16 ***
stateowned                      1.32e+03   1.52e+02    8.69  < 2e-16 ***
drivetrainfront                -2.13e+03   7.12e+01  -29.95  < 2e-16 ***
drivetrainrear                 -2.35e+03   8.37e+01  -28.06  < 2e-16 ***
is_exchangeableTrue            -1.47e+02   3.17e+01   -4.63  3.6e-06 ***
number_of_photos                5.40e+01   2.57e+00   21.06  < 2e-16 ***
up_counter                      1.77e+00   3.45e-01    5.12  3.0e-07 ***
poly(year_produced, 2)1         7.53e+05   4.29e+03  175.49  < 2e-16 ***
poly(year_produced, 2)2         3.93e+05   3.67e+03  107.08  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2880 on 38427 degrees of freedom
Multiple R-squared:  0.799, Adjusted R-squared:  0.799 
F-statistic: 1.65e+03 on 93 and 38427 DF,  p-value: <2e-16
#poly(year_produced,2)

print(vif(fit2))
                        GVIF Df GVIF^(1/(2*Df))
manufacturer_name      24.43 54            1.03
color                   1.58 11            1.02
transmission            1.83  1            1.35
odometer_value          1.75  1            1.32
engine_fuel             1.64  4            1.06
engine_capacity         2.27  1            1.51
body_type               8.89 11            1.10
has_warranty            3.66  1            1.91
state                   3.76  2            1.39
drivetrain              6.19  2            1.58
is_exchangeable         1.06  1            1.03
number_of_photos        1.13  1            1.06
up_counter              1.03  1            1.02
poly(year_produced, 2)  3.48  2            1.37

We can see a significant increase in adjusted R-squared and lower Residual standard error Note: Annova to be performed on all models

plot(df$odometer_value, df$price, main="odometer_value vs price", pch=19)

Note: how to model 1/x^2 in lm

fit3 <- lm(price ~ manufacturer_name+color+transmission+engine_fuel+engine_capacity+body_type+has_warranty+state+drivetrain+is_exchangeable+number_of_photos+state+up_counter+poly(year_produced,2)+(1/odometer_value), data = df)
summary(fit2)

Call:
lm(formula = price ~ manufacturer_name + color + transmission + 
    odometer_value + engine_fuel + engine_capacity + body_type + 
    has_warranty + state + drivetrain + is_exchangeable + number_of_photos + 
    state + up_counter + poly(year_produced, 2), data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-23983  -1333    -33   1100  30808 

Coefficients:
                                Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     9.92e+03   5.31e+02   18.68  < 2e-16 ***
manufacturer_nameAlfa Romeo    -1.44e+03   4.10e+02   -3.52  0.00043 ***
manufacturer_nameAudi          -1.13e+02   3.61e+02   -0.31  0.75354    
manufacturer_nameBMW            8.98e+02   3.62e+02    2.48  0.01321 *  
manufacturer_nameBuick         -4.41e+03   5.53e+02   -7.98  1.6e-15 ***
manufacturer_nameCadillac      -3.16e+03   5.68e+02   -5.56  2.7e-08 ***
manufacturer_nameChery         -7.18e+03   5.21e+02  -13.77  < 2e-16 ***
manufacturer_nameChevrolet     -3.94e+03   3.82e+02  -10.30  < 2e-16 ***
manufacturer_nameChrysler      -2.53e+03   3.84e+02   -6.57  5.0e-11 ***
manufacturer_nameCitroen       -2.38e+03   3.65e+02   -6.52  6.9e-11 ***
manufacturer_nameDacia         -4.72e+03   5.19e+02   -9.10  < 2e-16 ***
manufacturer_nameDaewoo        -2.96e+03   4.07e+02   -7.27  3.6e-13 ***
manufacturer_nameDodge         -2.83e+03   3.94e+02   -7.17  7.7e-13 ***
manufacturer_nameFiat          -2.05e+03   3.72e+02   -5.52  3.3e-08 ***
manufacturer_nameFord          -2.21e+03   3.62e+02   -6.12  9.7e-10 ***
manufacturer_nameGeely         -7.64e+03   4.95e+02  -15.42  < 2e-16 ***
manufacturer_nameGreat Wall    -7.15e+03   6.00e+02  -11.91  < 2e-16 ***
manufacturer_nameHonda         -1.20e+03   3.71e+02   -3.23  0.00123 ** 
manufacturer_nameHyundai       -2.52e+03   3.67e+02   -6.87  6.5e-12 ***
manufacturer_nameInfiniti      -1.23e+03   4.22e+02   -2.93  0.00342 ** 
manufacturer_nameIveco          7.23e+02   4.42e+02    1.63  0.10224    
manufacturer_nameJaguar         4.69e+03   5.34e+02    8.79  < 2e-16 ***
manufacturer_nameJeep          -3.09e+03   4.52e+02   -6.84  8.0e-12 ***
manufacturer_nameKia           -2.91e+03   3.69e+02   -7.90  2.9e-15 ***
manufacturer_nameLADA          -7.95e+03   4.32e+02  -18.42  < 2e-16 ***
manufacturer_nameLancia        -1.77e+03   4.67e+02   -3.78  0.00015 ***
manufacturer_nameLand Rover     7.80e+01   4.15e+02    0.19  0.85095    
manufacturer_nameLexus          3.25e+03   4.08e+02    7.97  1.6e-15 ***
manufacturer_nameLifan         -7.79e+03   5.53e+02  -14.08  < 2e-16 ***
manufacturer_nameLincoln       -2.14e+03   6.32e+02   -3.38  0.00071 ***
manufacturer_nameMazda         -1.66e+03   3.65e+02   -4.55  5.4e-06 ***
manufacturer_nameMercedes-Benz  5.33e+02   3.63e+02    1.47  0.14163    
manufacturer_nameMini           3.03e+02   5.01e+02    0.60  0.54609    
manufacturer_nameMitsubishi    -2.05e+03   3.69e+02   -5.55  2.8e-08 ***
manufacturer_nameNissan        -2.25e+03   3.65e+02   -6.15  7.7e-10 ***
manufacturer_nameOpel          -1.96e+03   3.62e+02   -5.42  6.1e-08 ***
manufacturer_namePeugeot       -2.19e+03   3.63e+02   -6.02  1.8e-09 ***
manufacturer_namePontiac       -2.18e+03   5.70e+02   -3.82  0.00013 ***
manufacturer_namePorsche        1.98e+03   5.13e+02    3.86  0.00011 ***
manufacturer_nameRenault       -2.71e+03   3.62e+02   -7.47  8.3e-14 ***
manufacturer_nameRover         -1.40e+03   4.04e+02   -3.46  0.00054 ***
manufacturer_nameSaab          -1.55e+03   4.51e+02   -3.42  0.00062 ***
manufacturer_nameSeat          -1.32e+03   3.94e+02   -3.34  0.00085 ***
manufacturer_nameSkoda         -2.71e+03   3.74e+02   -7.25  4.4e-13 ***
manufacturer_nameSsangYong     -5.30e+03   4.83e+02  -10.98  < 2e-16 ***
manufacturer_nameSubaru        -3.00e+03   3.97e+02   -7.56  4.2e-14 ***
manufacturer_nameSuzuki        -2.96e+03   4.04e+02   -7.33  2.3e-13 ***
manufacturer_nameToyota         1.51e+01   3.65e+02    0.04  0.96712    
manufacturer_nameVolkswagen    -1.03e+03   3.60e+02   -2.85  0.00435 ** 
manufacturer_nameVolvo         -6.85e+02   3.72e+02   -1.84  0.06532 .  
manufacturer_nameВАЗ           -2.73e+03   3.83e+02   -7.14  9.8e-13 ***
manufacturer_nameГАЗ           -6.28e+03   4.21e+02  -14.92  < 2e-16 ***
manufacturer_nameЗАЗ           -4.43e+03   5.73e+02   -7.74  9.9e-15 ***
manufacturer_nameМосквич       -4.14e+03   5.40e+02   -7.67  1.8e-14 ***
manufacturer_nameУАЗ           -7.56e+03   4.92e+02  -15.37  < 2e-16 ***
colorblue                       1.13e+01   5.24e+01    0.22  0.82957    
colorbrown                      1.75e+02   1.03e+02    1.69  0.09013 .  
colorgreen                      1.83e+02   6.74e+01    2.72  0.00663 ** 
colorgrey                      -2.69e+01   5.80e+01   -0.46  0.64271    
colororange                    -8.50e+01   2.18e+02   -0.39  0.69696    
colorother                     -1.20e+02   6.58e+01   -1.82  0.06813 .  
colorred                        1.46e+02   6.55e+01    2.23  0.02565 *  
colorsilver                    -3.82e+02   4.92e+01   -7.76  8.9e-15 ***
colorviolet                     2.68e+02   1.39e+02    1.92  0.05476 .  
colorwhite                      5.62e+00   5.85e+01    0.10  0.92349    
coloryellow                    -1.45e+02   1.72e+02   -0.84  0.39993    
transmissionmechanical         -5.18e+02   4.21e+01  -12.30  < 2e-16 ***
odometer_value                 -1.97e-03   1.43e-04  -13.81  < 2e-16 ***
engine_fuelgas                 -1.04e+03   8.50e+01  -12.29  < 2e-16 ***
engine_fuelgasoline            -8.52e+02   3.75e+01  -22.72  < 2e-16 ***
engine_fuelhybrid-diesel        2.07e+03   2.04e+03    1.01  0.31115    
engine_fuelhybrid-petrol       -6.36e+02   2.01e+02   -3.16  0.00158 ** 
engine_capacity                 1.41e+03   3.30e+01   42.82  < 2e-16 ***
body_typecoupe                 -1.99e+03   3.52e+02   -5.66  1.5e-08 ***
body_typehatchback             -3.60e+03   3.36e+02  -10.69  < 2e-16 ***
body_typeliftback              -2.92e+03   3.59e+02   -8.14  4.1e-16 ***
body_typelimousine             -1.56e+03   9.66e+02   -1.61  0.10638    
body_typeminibus               -8.05e+02   3.45e+02   -2.33  0.01970 *  
body_typeminivan               -2.79e+03   3.39e+02   -8.24  < 2e-16 ***
body_typepickup                -1.39e+03   4.23e+02   -3.28  0.00104 ** 
body_typesedan                 -3.83e+03   3.35e+02  -11.43  < 2e-16 ***
body_typesuv                   -1.89e+03   3.40e+02   -5.55  2.9e-08 ***
body_typeuniversal             -3.61e+03   3.37e+02  -10.69  < 2e-16 ***
body_typevan                   -2.56e+03   3.53e+02   -7.24  4.5e-13 ***
has_warrantyTrue               -6.43e+01   2.62e+02   -0.25  0.80600    
statenew                        8.03e+03   3.07e+02   26.19  < 2e-16 ***
stateowned                      1.32e+03   1.52e+02    8.69  < 2e-16 ***
drivetrainfront                -2.13e+03   7.12e+01  -29.95  < 2e-16 ***
drivetrainrear                 -2.35e+03   8.37e+01  -28.06  < 2e-16 ***
is_exchangeableTrue            -1.47e+02   3.17e+01   -4.63  3.6e-06 ***
number_of_photos                5.40e+01   2.57e+00   21.06  < 2e-16 ***
up_counter                      1.77e+00   3.45e-01    5.12  3.0e-07 ***
poly(year_produced, 2)1         7.53e+05   4.29e+03  175.49  < 2e-16 ***
poly(year_produced, 2)2         3.93e+05   3.67e+03  107.08  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2880 on 38427 degrees of freedom
Multiple R-squared:  0.799, Adjusted R-squared:  0.799 
F-statistic: 1.65e+03 on 93 and 38427 DF,  p-value: <2e-16
#poly(year_produced,2)
library(car)
print(vif(fit1))
                   GVIF Df GVIF^(1/(2*Df))
manufacturer_name 20.43 54            1.03
color              1.53 11            1.02
transmission       1.83  1            1.35
odometer_value     1.64  1            1.28
engine_fuel        1.64  4            1.06
engine_capacity    2.21  1            1.49
body_type          8.71 11            1.10
has_warranty       3.65  1            1.91
state              3.73  2            1.39
drivetrain         6.17  2            1.58
is_exchangeable    1.06  1            1.03
number_of_photos   1.13  1            1.06
up_counter         1.03  1            1.02
year_produced      2.14  1            1.46

No effect of (1/odometer_value)

Checking for interaction term between year_produced^2 *odometer_value. Removing color

fit4 <- lm(price ~ manufacturer_name+transmission+engine_fuel+engine_capacity+body_type+has_warranty+state+drivetrain+is_exchangeable+number_of_photos+state+up_counter+poly(year_produced,2)*odometer_value, data = df)
summary(fit4)

Call:
lm(formula = price ~ manufacturer_name + transmission + engine_fuel + 
    engine_capacity + body_type + has_warranty + state + drivetrain + 
    is_exchangeable + number_of_photos + state + up_counter + 
    poly(year_produced, 2) * odometer_value, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-21253  -1321    -25   1120  30231 

Coefficients:
                                        Estimate Std. Error t value Pr(>|t|)    
(Intercept)                             1.01e+04   5.27e+02   19.17  < 2e-16 ***
manufacturer_nameAlfa Romeo            -1.39e+03   4.07e+02   -3.41  0.00064 ***
manufacturer_nameAudi                  -1.04e+02   3.59e+02   -0.29  0.77226    
manufacturer_nameBMW                    9.39e+02   3.61e+02    2.60  0.00922 ** 
manufacturer_nameBuick                 -4.80e+03   5.50e+02   -8.73  < 2e-16 ***
manufacturer_nameCadillac              -3.11e+03   5.65e+02   -5.50  3.9e-08 ***
manufacturer_nameChery                 -7.25e+03   5.18e+02  -13.98  < 2e-16 ***
manufacturer_nameChevrolet             -4.02e+03   3.80e+02  -10.58  < 2e-16 ***
manufacturer_nameChrysler              -2.47e+03   3.82e+02   -6.47  9.8e-11 ***
manufacturer_nameCitroen               -2.32e+03   3.63e+02   -6.38  1.8e-10 ***
manufacturer_nameDacia                 -4.53e+03   5.16e+02   -8.77  < 2e-16 ***
manufacturer_nameDaewoo                -2.88e+03   4.05e+02   -7.12  1.1e-12 ***
manufacturer_nameDodge                 -2.76e+03   3.92e+02   -7.04  2.0e-12 ***
manufacturer_nameFiat                  -2.00e+03   3.70e+02   -5.41  6.5e-08 ***
manufacturer_nameFord                  -2.20e+03   3.60e+02   -6.12  9.4e-10 ***
manufacturer_nameGeely                 -7.95e+03   4.93e+02  -16.12  < 2e-16 ***
manufacturer_nameGreat Wall            -7.14e+03   5.96e+02  -11.98  < 2e-16 ***
manufacturer_nameHonda                 -1.11e+03   3.69e+02   -3.01  0.00259 ** 
manufacturer_nameHyundai               -2.55e+03   3.65e+02   -6.98  3.0e-12 ***
manufacturer_nameInfiniti              -1.25e+03   4.19e+02   -2.99  0.00282 ** 
manufacturer_nameIveco                  9.94e+02   4.40e+02    2.26  0.02395 *  
manufacturer_nameJaguar                 4.64e+03   5.31e+02    8.74  < 2e-16 ***
manufacturer_nameJeep                  -3.07e+03   4.50e+02   -6.82  9.0e-12 ***
manufacturer_nameKia                   -2.95e+03   3.67e+02   -8.04  9.6e-16 ***
manufacturer_nameLADA                  -8.38e+03   4.30e+02  -19.50  < 2e-16 ***
manufacturer_nameLancia                -1.72e+03   4.64e+02   -3.69  0.00022 ***
manufacturer_nameLand Rover             1.34e+02   4.13e+02    0.32  0.74608    
manufacturer_nameLexus                  3.25e+03   4.06e+02    8.00  1.3e-15 ***
manufacturer_nameLifan                 -8.23e+03   5.51e+02  -14.94  < 2e-16 ***
manufacturer_nameLincoln               -2.16e+03   6.28e+02   -3.44  0.00058 ***
manufacturer_nameMazda                 -1.63e+03   3.63e+02   -4.48  7.6e-06 ***
manufacturer_nameMercedes-Benz          5.40e+02   3.61e+02    1.50  0.13461    
manufacturer_nameMini                   2.60e+02   4.98e+02    0.52  0.60103    
manufacturer_nameMitsubishi            -2.03e+03   3.67e+02   -5.52  3.4e-08 ***
manufacturer_nameNissan                -2.25e+03   3.63e+02   -6.19  5.9e-10 ***
manufacturer_nameOpel                  -1.91e+03   3.60e+02   -5.30  1.1e-07 ***
manufacturer_namePeugeot               -2.11e+03   3.61e+02   -5.84  5.4e-09 ***
manufacturer_namePontiac               -2.07e+03   5.67e+02   -3.65  0.00026 ***
manufacturer_namePorsche                2.06e+03   5.10e+02    4.04  5.4e-05 ***
manufacturer_nameRenault               -2.68e+03   3.60e+02   -7.45  9.5e-14 ***
manufacturer_nameRover                 -1.32e+03   4.02e+02   -3.28  0.00106 ** 
manufacturer_nameSaab                  -1.48e+03   4.49e+02   -3.30  0.00096 ***
manufacturer_nameSeat                  -1.26e+03   3.92e+02   -3.21  0.00134 ** 
manufacturer_nameSkoda                 -2.60e+03   3.72e+02   -6.98  3.0e-12 ***
manufacturer_nameSsangYong             -5.27e+03   4.80e+02  -10.96  < 2e-16 ***
manufacturer_nameSubaru                -2.95e+03   3.94e+02   -7.47  8.3e-14 ***
manufacturer_nameSuzuki                -2.90e+03   4.02e+02   -7.23  5.0e-13 ***
manufacturer_nameToyota                 5.77e+01   3.63e+02    0.16  0.87390    
manufacturer_nameVolkswagen            -9.74e+02   3.58e+02   -2.72  0.00650 ** 
manufacturer_nameVolvo                 -5.51e+02   3.70e+02   -1.49  0.13585    
manufacturer_nameВАЗ                   -2.47e+03   3.81e+02   -6.47  9.6e-11 ***
manufacturer_nameГАЗ                   -5.46e+03   4.22e+02  -12.94  < 2e-16 ***
manufacturer_nameЗАЗ                   -4.05e+03   5.70e+02   -7.12  1.1e-12 ***
manufacturer_nameМосквич               -3.01e+03   5.41e+02   -5.55  2.8e-08 ***
manufacturer_nameУАЗ                   -7.37e+03   4.89e+02  -15.08  < 2e-16 ***
transmissionmechanical                 -5.33e+02   4.19e+01  -12.73  < 2e-16 ***
engine_fuelgas                         -1.18e+03   8.48e+01  -13.92  < 2e-16 ***
engine_fuelgasoline                    -9.74e+02   3.78e+01  -25.81  < 2e-16 ***
engine_fuelhybrid-diesel                1.92e+03   2.03e+03    0.94  0.34550    
engine_fuelhybrid-petrol               -7.33e+02   2.00e+02   -3.66  0.00026 ***
engine_capacity                         1.46e+03   3.27e+01   44.62  < 2e-16 ***
body_typecoupe                         -1.97e+03   3.50e+02   -5.62  1.9e-08 ***
body_typehatchback                     -3.57e+03   3.35e+02  -10.66  < 2e-16 ***
body_typeliftback                      -2.96e+03   3.57e+02   -8.29  < 2e-16 ***
body_typelimousine                     -1.59e+03   9.60e+02   -1.65  0.09799 .  
body_typeminibus                       -7.01e+02   3.43e+02   -2.04  0.04106 *  
body_typeminivan                       -2.70e+03   3.37e+02   -8.03  1.0e-15 ***
body_typepickup                        -1.35e+03   4.21e+02   -3.20  0.00136 ** 
body_typesedan                         -3.84e+03   3.33e+02  -11.52  < 2e-16 ***
body_typesuv                           -1.94e+03   3.38e+02   -5.74  9.5e-09 ***
body_typeuniversal                     -3.55e+03   3.35e+02  -10.59  < 2e-16 ***
body_typevan                           -2.47e+03   3.51e+02   -7.05  1.8e-12 ***
has_warrantyTrue                       -4.59e+02   2.61e+02   -1.76  0.07849 .  
statenew                                7.12e+03   3.09e+02   23.04  < 2e-16 ***
stateowned                              1.36e+03   1.51e+02    9.04  < 2e-16 ***
drivetrainfront                        -2.19e+03   7.08e+01  -30.95  < 2e-16 ***
drivetrainrear                         -2.40e+03   8.33e+01  -28.77  < 2e-16 ***
is_exchangeableTrue                    -1.51e+02   3.15e+01   -4.78  1.8e-06 ***
number_of_photos                        5.48e+01   2.55e+00   21.50  < 2e-16 ***
up_counter                              1.80e+00   3.43e-01    5.26  1.4e-07 ***
poly(year_produced, 2)1                 8.54e+05   6.16e+03  138.71  < 2e-16 ***
poly(year_produced, 2)2                 3.95e+05   4.80e+03   82.22  < 2e-16 ***
odometer_value                         -4.16e-03   1.71e-04  -24.38  < 2e-16 ***
poly(year_produced, 2)1:odometer_value -6.21e-01   2.74e-02  -22.69  < 2e-16 ***
poly(year_produced, 2)2:odometer_value -1.89e-01   2.30e-02   -8.21  2.3e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2870 on 38436 degrees of freedom
Multiple R-squared:  0.801, Adjusted R-squared:  0.801 
F-statistic: 1.85e+03 on 84 and 38436 DF,  p-value: <2e-16
library(car)
print(vif(fit4))
                                       GVIF Df GVIF^(1/(2*Df))
manufacturer_name                     25.46 54            1.03
transmission                           1.83  1            1.35
engine_fuel                            1.68  4            1.07
engine_capacity                        2.26  1            1.50
body_type                              8.18 11            1.10
has_warranty                           3.68  1            1.92
state                                  3.90  2            1.41
drivetrain                             6.18  2            1.58
is_exchangeable                        1.06  1            1.03
number_of_photos                       1.13  1            1.06
up_counter                             1.03  1            1.02
poly(year_produced, 2)                12.69  2            1.89
odometer_value                         2.53  1            1.59
poly(year_produced, 2):odometer_value 11.15  2            1.83

We observe a increase in the Adjusted R-squared

Adding year_produced*number_of_photos

fit5 <- lm(price ~ manufacturer_name+transmission+engine_fuel+engine_capacity+body_type+has_warranty+state+drivetrain+is_exchangeable+state+up_counter+poly(year_produced,2)*odometer_value*number_of_photos, data = df)
summary(fit5)

Call:
lm(formula = price ~ manufacturer_name + transmission + engine_fuel + 
    engine_capacity + body_type + has_warranty + state + drivetrain + 
    is_exchangeable + state + up_counter + poly(year_produced, 
    2) * odometer_value * number_of_photos, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-18543  -1283     -8   1095  29201 

Coefficients:
                                                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)                                              9.78e+03   5.22e+02   18.73  < 2e-16 ***
manufacturer_nameAlfa Romeo                             -1.40e+03   4.01e+02   -3.49  0.00048 ***
manufacturer_nameAudi                                   -7.37e+01   3.54e+02   -0.21  0.83495    
manufacturer_nameBMW                                     9.14e+02   3.55e+02    2.58  0.01002 *  
manufacturer_nameBuick                                  -5.31e+03   5.42e+02   -9.80  < 2e-16 ***
manufacturer_nameCadillac                               -3.11e+03   5.57e+02   -5.58  2.4e-08 ***
manufacturer_nameChery                                  -7.11e+03   5.10e+02  -13.92  < 2e-16 ***
manufacturer_nameChevrolet                              -3.95e+03   3.74e+02  -10.54  < 2e-16 ***
manufacturer_nameChrysler                               -2.49e+03   3.76e+02   -6.60  4.1e-11 ***
manufacturer_nameCitroen                                -2.32e+03   3.57e+02   -6.49  8.9e-11 ***
manufacturer_nameDacia                                  -4.42e+03   5.08e+02   -8.70  < 2e-16 ***
manufacturer_nameDaewoo                                 -2.85e+03   3.99e+02   -7.14  9.7e-13 ***
manufacturer_nameDodge                                  -2.78e+03   3.86e+02   -7.21  5.9e-13 ***
manufacturer_nameFiat                                   -2.06e+03   3.64e+02   -5.66  1.6e-08 ***
manufacturer_nameFord                                   -2.26e+03   3.54e+02   -6.38  1.8e-10 ***
manufacturer_nameGeely                                  -7.78e+03   4.85e+02  -16.03  < 2e-16 ***
manufacturer_nameGreat Wall                             -7.12e+03   5.87e+02  -12.13  < 2e-16 ***
manufacturer_nameHonda                                  -1.08e+03   3.64e+02   -2.97  0.00294 ** 
manufacturer_nameHyundai                                -2.44e+03   3.59e+02   -6.80  1.0e-11 ***
manufacturer_nameInfiniti                               -1.29e+03   4.13e+02   -3.14  0.00171 ** 
manufacturer_nameIveco                                   9.13e+02   4.33e+02    2.11  0.03504 *  
manufacturer_nameJaguar                                  4.29e+03   5.23e+02    8.21  2.3e-16 ***
manufacturer_nameJeep                                   -3.03e+03   4.43e+02   -6.84  8.1e-12 ***
manufacturer_nameKia                                    -2.90e+03   3.61e+02   -8.02  1.1e-15 ***
manufacturer_nameLADA                                   -8.00e+03   4.23e+02  -18.89  < 2e-16 ***
manufacturer_nameLancia                                 -1.70e+03   4.57e+02   -3.73  0.00019 ***
manufacturer_nameLand Rover                              2.17e+01   4.07e+02    0.05  0.95744    
manufacturer_nameLexus                                   3.19e+03   3.99e+02    8.00  1.3e-15 ***
manufacturer_nameLifan                                  -8.03e+03   5.42e+02  -14.80  < 2e-16 ***
manufacturer_nameLincoln                                -2.35e+03   6.19e+02   -3.79  0.00015 ***
manufacturer_nameMazda                                  -1.63e+03   3.58e+02   -4.55  5.3e-06 ***
manufacturer_nameMercedes-Benz                           5.14e+02   3.55e+02    1.45  0.14779    
manufacturer_nameMini                                   -1.57e+02   4.91e+02   -0.32  0.74929    
manufacturer_nameMitsubishi                             -2.05e+03   3.62e+02   -5.66  1.5e-08 ***
manufacturer_nameNissan                                 -2.20e+03   3.57e+02   -6.15  7.8e-10 ***
manufacturer_nameOpel                                   -1.89e+03   3.54e+02   -5.34  9.3e-08 ***
manufacturer_namePeugeot                                -2.09e+03   3.56e+02   -5.89  4.0e-09 ***
manufacturer_namePontiac                                -2.09e+03   5.58e+02   -3.74  0.00018 ***
manufacturer_namePorsche                                 2.04e+03   5.03e+02    4.06  4.9e-05 ***
manufacturer_nameRenault                                -2.65e+03   3.55e+02   -7.46  8.9e-14 ***
manufacturer_nameRover                                  -1.33e+03   3.96e+02   -3.36  0.00077 ***
manufacturer_nameSaab                                   -1.44e+03   4.42e+02   -3.27  0.00109 ** 
manufacturer_nameSeat                                   -1.30e+03   3.86e+02   -3.37  0.00077 ***
manufacturer_nameSkoda                                  -2.67e+03   3.67e+02   -7.28  3.3e-13 ***
manufacturer_nameSsangYong                              -5.21e+03   4.73e+02  -11.01  < 2e-16 ***
manufacturer_nameSubaru                                 -2.88e+03   3.88e+02   -7.41  1.3e-13 ***
manufacturer_nameSuzuki                                 -2.82e+03   3.95e+02   -7.14  9.5e-13 ***
manufacturer_nameToyota                                  7.22e+01   3.58e+02    0.20  0.84003    
manufacturer_nameVolkswagen                             -9.53e+02   3.53e+02   -2.70  0.00685 ** 
manufacturer_nameVolvo                                  -5.50e+02   3.64e+02   -1.51  0.13058    
manufacturer_nameВАЗ                                    -2.48e+03   3.75e+02   -6.61  3.8e-11 ***
manufacturer_nameГАЗ                                    -5.28e+03   4.16e+02  -12.70  < 2e-16 ***
manufacturer_nameЗАЗ                                    -4.10e+03   5.61e+02   -7.30  3.0e-13 ***
manufacturer_nameМосквич                                -2.52e+03   5.34e+02   -4.73  2.3e-06 ***
manufacturer_nameУАЗ                                    -7.33e+03   4.81e+02  -15.22  < 2e-16 ***
transmissionmechanical                                  -5.12e+02   4.12e+01  -12.40  < 2e-16 ***
engine_fuelgas                                          -1.16e+03   8.35e+01  -13.88  < 2e-16 ***
engine_fuelgasoline                                     -9.37e+02   3.72e+01  -25.17  < 2e-16 ***
engine_fuelhybrid-diesel                                 2.18e+03   2.00e+03    1.09  0.27519    
engine_fuelhybrid-petrol                                -7.09e+02   1.97e+02   -3.60  0.00032 ***
engine_capacity                                          1.50e+03   3.23e+01   46.63  < 2e-16 ***
body_typecoupe                                          -2.02e+03   3.45e+02   -5.86  4.7e-09 ***
body_typehatchback                                      -3.60e+03   3.30e+02  -10.91  < 2e-16 ***
body_typeliftback                                       -2.99e+03   3.52e+02   -8.49  < 2e-16 ***
body_typelimousine                                      -1.44e+03   9.46e+02   -1.52  0.12826    
body_typeminibus                                        -7.81e+02   3.38e+02   -2.31  0.02075 *  
body_typeminivan                                        -2.74e+03   3.32e+02   -8.26  < 2e-16 ***
body_typepickup                                         -1.29e+03   4.14e+02   -3.12  0.00178 ** 
body_typesedan                                          -3.87e+03   3.28e+02  -11.81  < 2e-16 ***
body_typesuv                                            -1.97e+03   3.33e+02   -5.92  3.2e-09 ***
body_typeuniversal                                      -3.59e+03   3.30e+02  -10.88  < 2e-16 ***
body_typevan                                            -2.60e+03   3.45e+02   -7.53  5.2e-14 ***
has_warrantyTrue                                        -8.33e+02   2.58e+02   -3.23  0.00124 ** 
statenew                                                 6.92e+03   3.06e+02   22.66  < 2e-16 ***
stateowned                                               1.50e+03   1.49e+02   10.07  < 2e-16 ***
drivetrainfront                                         -2.13e+03   6.98e+01  -30.56  < 2e-16 ***
drivetrainrear                                          -2.31e+03   8.21e+01  -28.10  < 2e-16 ***
is_exchangeableTrue                                     -2.24e+02   3.11e+01   -7.20  6.3e-13 ***
up_counter                                               1.95e+00   3.38e-01    5.77  8.0e-09 ***
poly(year_produced, 2)1                                  6.73e+05   9.78e+03   68.82  < 2e-16 ***
poly(year_produced, 2)2                                  2.91e+05   7.49e+03   38.85  < 2e-16 ***
odometer_value                                          -3.05e-03   2.91e-04  -10.51  < 2e-16 ***
number_of_photos                                         4.11e+01   6.68e+00    6.16  7.5e-10 ***
poly(year_produced, 2)1:odometer_value                  -2.13e-01   4.66e-02   -4.58  4.6e-06 ***
poly(year_produced, 2)2:odometer_value                  -9.76e-02   3.80e-02   -2.57  0.01030 *  
poly(year_produced, 2)1:number_of_photos                 1.73e+04   8.47e+02   20.44  < 2e-16 ***
poly(year_produced, 2)2:number_of_photos                 9.43e+03   6.74e+02   13.99  < 2e-16 ***
odometer_value:number_of_photos                         -5.79e-05   2.62e-05   -2.21  0.02734 *  
poly(year_produced, 2)1:odometer_value:number_of_photos -3.52e-02   3.99e-03   -8.80  < 2e-16 ***
poly(year_produced, 2)2:odometer_value:number_of_photos -3.20e-03   3.72e-03   -0.86  0.38875    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2820 on 38431 degrees of freedom
Multiple R-squared:  0.808, Adjusted R-squared:  0.807 
F-statistic: 1.81e+03 on 89 and 38431 DF,  p-value: <2e-16
library(car)
print(vif(fit5))
                                                        GVIF Df GVIF^(1/(2*Df))
manufacturer_name                                      26.40 54            1.03
transmission                                            1.83  1            1.35
engine_fuel                                             1.68  4            1.07
engine_capacity                                         2.27  1            1.51
body_type                                               8.22 11            1.10
has_warranty                                            3.71  1            1.93
state                                                   3.96  2            1.41
drivetrain                                              6.19  2            1.58
is_exchangeable                                         1.07  1            1.03
up_counter                                              1.03  1            1.02
poly(year_produced, 2)                                 80.26  2            2.99
odometer_value                                          7.55  1            2.75
number_of_photos                                        8.01  1            2.83
poly(year_produced, 2):odometer_value                  91.97  2            3.10
poly(year_produced, 2):number_of_photos                83.49  2            3.02
odometer_value:number_of_photos                        11.84  1            3.44
poly(year_produced, 2):odometer_value:number_of_photos 78.97  2            2.98

Very high VIFs for the new interaction terms

When the regression model passed either t test or F test, the only thing confirmed is the linear relationship between x and y is significant, or the regression model is valid. It does not gurantee the fit is good enough and we cannot rule out data unreliability due to other unknown factors, like outliers.

y outlier check

# install.packages("olsrr")
library(olsrr)
ols_plot_resid_stud_fit(fit3, print_plot = TRUE)

x outlier check

plot(cooks.distance(
fit3)) # replace model

Since all the distances are well below than the treshold 1, there is low likelihood of X outliers.

multicollinearity removal

# install.packages("pls")
library(pls)
library(tidyverse)
df_mul = df[, !(names(df) %in% c("price"))]
df_mul %>% select_if(is.numeric)->cars_numerical
cars_numerical = scale(cars_numerical)
df_mul %>% select_if(negate(is.numeric))->cars_cat
df3 = cbind(cars_numerical, cars_cat)



pls1 = plsr(price_normal ~ odometer_value*poly(year_produced, 2)+sin(engine_capacity)+poly(number_of_photos, 2)+up_counter+manufacturer_name+color+transmission+engine_fuel+body_type+has_warranty+state+drivetrain+is_exchangeable+state, data = df3, validation = "CV")
summary(pls1, what = "all")
Data:   X dimension: 38521 97 
    Y dimension: 38521 1
Fit method: kernelpls
Number of components considered: 97

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
CV               1   0.7864    0.723   0.6866   0.6674   0.6576   0.6464   0.6406   0.6338   0.6298    0.6273    0.6252    0.6233    0.6217    0.6203    0.6193    0.6183    0.6170    0.6152    0.6137    0.6121    0.6106    0.6093    0.6072    0.6046    0.5992    0.5923    0.5846    0.5751    0.5561
adjCV            1   0.7864    0.723   0.6865   0.6658   0.6514   0.6469   0.6405   0.6338   0.6298    0.6272    0.6251    0.6232    0.6216    0.6202    0.6193    0.6182    0.6169    0.6151    0.6136    0.6120    0.6105    0.6092    0.6072    0.6046    0.5992    0.5923    0.5846    0.5750    0.5557
       30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps  44 comps  45 comps  46 comps  47 comps  48 comps  49 comps  50 comps  51 comps  52 comps  53 comps  54 comps  55 comps  56 comps  57 comps  58 comps
CV       0.5272    0.5094    0.4943    0.4724    0.4601    0.4471    0.4367    0.4314    0.4267    0.4241    0.4226    0.4213    0.4203    0.4197    0.4189    0.4182    0.4171    0.4157    0.4140    0.4111    0.4091     0.406    0.4029    0.4006    0.3981    0.3960    0.3938    0.3878    0.3825
adjCV    0.5266    0.5089    0.4938    0.4722    0.4597    0.4466    0.4364    0.4315    0.4264    0.4237    0.4225    0.4212    0.4202    0.4196    0.4188    0.4181    0.4170    0.4156    0.4138    0.4109    0.4088     0.406    0.4031    0.4011    0.3989    0.3968    0.3949    0.3880    0.3829
       59 comps  60 comps  61 comps  62 comps  63 comps  64 comps  65 comps  66 comps  67 comps  68 comps  69 comps  70 comps  71 comps  72 comps  73 comps  74 comps  75 comps  76 comps  77 comps  78 comps  79 comps  80 comps  81 comps  82 comps  83 comps  84 comps  85 comps  86 comps  87 comps
CV       0.3766    0.3721    0.3702    0.3683    0.3667    0.3651    0.3641    0.3625    0.3608    0.3597    0.3591    0.3588    0.3584    0.3559    0.3529    0.3476    0.3460    0.3452    0.3449    0.3443    0.3434    0.3419    0.3407    0.3404    0.3404    0.3404    0.3403    0.3402    0.3402
adjCV    0.3750    0.3710    0.3693    0.3679    0.3665    0.3648    0.3639    0.3623    0.3605    0.3595    0.3590    0.3588    0.3585    0.3560    0.3522    0.3466    0.3455    0.3450    0.3448    0.3445    0.3438    0.3422    0.3405    0.3403    0.3403    0.3403    0.3403    0.3402    0.3401
       88 comps  89 comps  90 comps  91 comps  92 comps  93 comps  94 comps  95 comps  96 comps  97 comps
CV       0.3402    0.3402    0.3402    0.3402    0.3402    0.3402    0.3402    0.3402    0.3402    0.3402
adjCV    0.3401    0.3401    0.3401    0.3401    0.3401    0.3401    0.3401    0.3401    0.3401    0.3401

TRAINING: % variance explained
              1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps  16 comps  17 comps  18 comps  19 comps  20 comps  21 comps  22 comps  23 comps  24 comps  25 comps  26 comps  27 comps  28 comps  29 comps
X               15.86    26.38    31.21    34.73    39.53    53.52    56.23    57.60    59.80     62.05     63.58     65.12     66.96     68.92     71.27     72.76     74.07     75.06     76.48     77.82     78.95     79.98     80.73     81.93     82.76     83.62     84.66     85.61     86.21
price_normal    38.19    47.78    52.94    55.77    57.67    58.23    59.10    59.96    60.48     60.81     61.08     61.33     61.53     61.71     61.82     61.95     62.13     62.35     62.52     62.72     62.90     63.06     63.31     63.62     64.29     65.11     66.01     67.16     69.37
              30 comps  31 comps  32 comps  33 comps  34 comps  35 comps  36 comps  37 comps  38 comps  39 comps  40 comps  41 comps  42 comps  43 comps  44 comps  45 comps  46 comps  47 comps  48 comps  49 comps  50 comps  51 comps  52 comps  53 comps  54 comps  55 comps  56 comps  57 comps
X                86.69     87.43     88.14     88.66     89.42     89.81     90.29     91.14     91.55     91.99     92.64     93.10     93.40     93.76     94.01     94.35     94.67     94.92     95.18     95.42     95.72     96.05     96.27     96.57     96.82     97.07     97.35     97.48
price_normal     72.52     74.35     75.86     77.91     79.08     80.24     81.14     81.54     81.96     82.17     82.27     82.38     82.47     82.51     82.58     82.65     82.74     82.85     83.00     83.21     83.42     83.63     83.88     84.03     84.19     84.36     84.55     85.12
              58 comps  59 comps  60 comps  61 comps  62 comps  63 comps  64 comps  65 comps  66 comps  67 comps  68 comps  69 comps  70 comps  71 comps  72 comps  73 comps  74 comps  75 comps  76 comps  77 comps  78 comps  79 comps  80 comps  81 comps  82 comps  83 comps  84 comps  85 comps
X                97.75     97.91     98.06     98.27     98.43     98.52     98.59     98.71     98.79     98.85     98.90     98.96     99.04     99.11     99.16     99.22     99.27     99.34     99.39     99.45     99.50     99.56     99.61     99.66     99.69     99.73     99.80     99.82
price_normal     85.53     86.14     86.44     86.55     86.65     86.73     86.83     86.91     87.03     87.15     87.21     87.24     87.26     87.29     87.46     87.75     88.10     88.17     88.20     88.22     88.24     88.29     88.39     88.48     88.49     88.49     88.49     88.50
              86 comps  87 comps  88 comps  89 comps  90 comps  91 comps  92 comps  93 comps  94 comps  95 comps  96 comps  97 comps
X                99.83     99.86     99.88     99.89     99.90     99.93     99.94     99.95     99.98     99.99    100.00    100.11
price_normal     88.51     88.51     88.51     88.51     88.51     88.51     88.51     88.51     88.51     88.51     88.51     88.51
pls.RMSEP = RMSEP(pls1, estimate="CV")
plot(pls.RMSEP, main="RMSEP PLS Price", xlab="components")
min_comp = which.min(pls.RMSEP$val)
min(pls.RMSEP$val)
[1] 0.34
points(min_comp, min(pls.RMSEP$val), pch=1, col="red", cex=1.5)

plot(pls1, ncomp = 88, line = TRUE)

To solve the problem of multicollinearity, Pattern6 utilized partial least square method.According to Frank L.E. and Friedman (1993), compared to other methods like ridge regression and PCR, fewer assumptions need to be made, and yield better results. Cross validation was performed to calculate RMSEP. The comps are Letent Factors and at most 97 components were utilized. At around 86 components, the RMSEP gets extreme small value. Thus 86 components is used to generate the final model.

pls2 = plsr(price_normal ~ odometer_value*poly(year_produced, 2)+sin(engine_capacity)+poly(number_of_photos, 2)+up_counter +manufacturer_name+color+transmission+engine_fuel+body_type+has_warranty+state+drivetrain+is_exchangeable+state, data = df3, jackknife = TRUE, validation = "CV", ncomp = 88)
coef(pls2)
, , 88 comps

                                       price_normal
odometer_value                             1.34e-03
poly(year_produced, 2)1                    1.49e+02
poly(year_produced, 2)2                    4.14e+01
sin(engine_capacity)                       1.98e-01
poly(number_of_photos, 2)1                 8.52e+00
poly(number_of_photos, 2)2                -7.70e-01
up_counter                                 1.60e-02
manufacturer_nameAlfa Romeo               -3.92e-01
manufacturer_nameAudi                      6.58e-02
manufacturer_nameBMW                       7.36e-02
manufacturer_nameBuick                    -5.50e-01
manufacturer_nameCadillac                  8.91e-03
manufacturer_nameChery                    -1.10e+00
manufacturer_nameChevrolet                -4.72e-01
manufacturer_nameChrysler                 -3.91e-01
manufacturer_nameCitroen                  -4.01e-01
manufacturer_nameDacia                    -6.69e-01
manufacturer_nameDaewoo                   -8.10e-01
manufacturer_nameDodge                    -4.81e-01
manufacturer_nameFiat                     -4.99e-01
manufacturer_nameFord                     -4.55e-01
manufacturer_nameGeely                    -1.12e+00
manufacturer_nameGreat Wall               -9.52e-01
manufacturer_nameHonda                    -1.39e-01
manufacturer_nameHyundai                  -4.12e-01
manufacturer_nameInfiniti                 -2.76e-02
manufacturer_nameIveco                    -3.44e-02
manufacturer_nameJaguar                    2.34e-01
manufacturer_nameJeep                     -2.32e-01
manufacturer_nameKia                      -4.55e-01
manufacturer_nameLADA                     -9.89e-01
manufacturer_nameLancia                   -4.10e-01
manufacturer_nameLand Rover               -5.35e-02
manufacturer_nameLexus                     2.83e-01
manufacturer_nameLifan                    -1.05e+00
manufacturer_nameLincoln                   1.25e-01
manufacturer_nameMazda                    -3.64e-01
manufacturer_nameMercedes-Benz             4.67e-02
manufacturer_nameMini                      4.39e-02
manufacturer_nameMitsubishi               -3.79e-01
manufacturer_nameNissan                   -3.84e-01
manufacturer_nameOpel                     -3.36e-01
manufacturer_namePeugeot                  -3.20e-01
manufacturer_namePontiac                  -2.43e-01
manufacturer_namePorsche                   3.38e-01
manufacturer_nameRenault                  -4.67e-01
manufacturer_nameRover                    -4.44e-01
manufacturer_nameSaab                     -2.30e-01
manufacturer_nameSeat                     -2.96e-01
manufacturer_nameSkoda                    -3.00e-01
manufacturer_nameSsangYong                -6.29e-01
manufacturer_nameSubaru                   -3.16e-01
manufacturer_nameSuzuki                   -4.39e-01
manufacturer_nameToyota                    5.23e-04
manufacturer_nameVolkswagen               -1.14e-01
manufacturer_nameVolvo                    -1.37e-01
manufacturer_nameВАЗ                      -7.20e-01
manufacturer_nameГАЗ                      -7.24e-01
manufacturer_nameЗАЗ                      -1.05e+00
manufacturer_nameМосквич                  -6.76e-01
manufacturer_nameУАЗ                      -1.08e+00
colorblue                                 -4.81e-02
colorbrown                                -3.43e-02
colorgreen                                -6.62e-02
colorgrey                                 -2.79e-02
colororange                               -2.56e-02
colorother                                -3.54e-02
colorred                                  -8.24e-02
colorsilver                               -3.79e-02
colorviolet                               -6.66e-02
colorwhite                                -9.46e-02
coloryellow                               -2.89e-02
transmissionmechanical                    -1.03e-01
engine_fuelelectric                        0.00e+00
engine_fuelgas                            -1.12e-01
engine_fuelgasoline                       -1.08e-01
engine_fuelhybrid-diesel                   3.88e-01
engine_fuelhybrid-petrol                  -9.32e-02
body_typecoupe                            -2.84e-01
body_typehatchback                        -5.63e-01
body_typeliftback                         -4.85e-01
body_typelimousine                         1.30e-01
body_typeminibus                          -9.74e-02
body_typeminivan                          -2.89e-01
body_typepickup                           -7.02e-03
body_typesedan                            -5.52e-01
body_typesuv                              -2.57e-01
body_typeuniversal                        -5.40e-01
body_typevan                              -2.73e-01
has_warrantyTrue                          -1.04e-01
statenew                                   9.25e-01
stateowned                                 7.29e-01
drivetrainfront                           -1.91e-01
drivetrainrear                            -1.37e-01
is_exchangeableTrue                       -2.54e-02
odometer_value:poly(year_produced, 2)1     7.99e+00
odometer_value:poly(year_produced, 2)2    -1.27e+00

REGULARIZATION

RIDGE

In situations when the independent variables are highly correlated, ridge regression is a method of calculating the coefficients of multiple-regression models.

Initialising independent variables(x) and dependent variable(y) for framing train and test data from the dataset.

library("ISLR")
df_reg = uzscale(df, append=0, "price_normal")
x=model.matrix(price_normal~.,df_reg)[,-1]
y=df_reg$price_normal

Prepare train and test data for independent variables(x) and dependent variable(y).

library("dplyr")
set.seed(1)
train = df_reg %>% sample_frac(0.75)
test = df_reg %>% setdiff(train)

x_train = model.matrix(price_normal~., train)[,-1]
x_test = model.matrix(price_normal~., test)[,-1]

y_train = train$price_normal %>% unlist()
y_test = test$price_normal %>% unlist()
# y_train = train %>% select(price) %>% unlist() # %>% as.numeric()
# y_test = test %>% select(price) %>% unlist() # %>% as.numeric()

Building Ridge Model

library("glmnet")
grid=10^seq(10,-2,length=100)
ridge.mod=glmnet(x_train,y_train,alpha=0,grid=grid) 
plot(ridge.mod)

Using Grid search to find the optimal lambda value

# set.seed(1)
# ridge_cv=cv.glmnet(x_train,y_train,alpha=0, standardize = TRUE, nfolds = 10)  # Fit ridge regression model on training data
# 
# # Plot cross-validation results
# plot(ridge_cv)
#   
# # Best cross-validated lambda
# lambda_cv <- ridge_cv$lambda.min

Cross-validation is a statistical method for evaluating and comparing learning algorithms that divides data into two segments: one for learning or training a model and the other for validating it.

Cross validation involves the following steps:

  1. Allocate a sample data set for study.
  2. Use the rest of the dataset to train the model.
  3. Use the test (validation) set’s reserve sample. This will assist you in determining how effective your model’s performance is. Proceed with the existing model if your model produces a positive result on validation data. It’s fantastic!

Performed prediction for the model, R-Squared test to check how model fits the set of observations and MSE test to check how close the estimates are close to the actual values.

# read the lambda value for cross-validation
lambda_cv <- 0.581
# Fit final model, get its sum of squared
# residuals and multiple R-squared
print("The value of lambda is for the lowest MSE is ")
[1] "The value of lambda is for the lowest MSE is "
print(lambda_cv)
[1] 0.581
model_cv <- glmnet(x_train,y_train, alpha = 0, lambda = lambda_cv,  standardize = TRUE)
y_hat_cv <- predict(model_cv, x_test)
ssr_cv <- t(y_test - y_hat_cv) %*% (y_test - y_hat_cv)

#R-sqaured
rsq_ridge_cv <- cor(y_test, y_hat_cv)^2
rsq_ridge_cv
        s0
[1,] 0.942
#MSE on test
mse0 <- mean((y_test - y_hat_cv) ^ 2)
sqrt(mse0)
[1] 1.56
print("R-sqaured")
[1] "R-sqaured"
print(rsq_ridge_cv)
        s0
[1,] 0.942
print("MSE on test")
[1] "MSE on test"
print(mse0)
[1] 2.43

LASSO

Lasso regression is a sort of shrinkage-based linear regression. Data values are shrunk towards a central point, such as the mean, in shrinkage. Simple, sparse models are encouraged by the lasso approach (i.e. models with fewer parameters).The purpose of lasso regression is to find the subset of predictors that produces the least amount of prediction error for a quantitative response variable.

Initialising independent variables(x) and dependent variable(y) for framing train and test data from the dataset.

grid=10^seq(10,-2,length=100)
lasso.mod=glmnet(x_train,y_train,alpha=1,grid=grid) 
plot(lasso.mod)

# # Center y, X will be standardized in the modelling function
# 
# # lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
#   
# # Perform 10-fold cross-validation to select lambda 
# # Setting alpha = 1 implements lasso regression
# lasso_cv <- cv.glmnet(x_train, y_train, alpha = 1,  nfolds = 10)
#   
# # Plot cross-validation results
# plot(lasso_cv)
#   
# # Best cross-validated lambda
# lambda_cv <- lasso_cv$lambda.min
# print("Best lambda")
# print(lambda_cv)

Prepare train and test data for independent variables(x) and dependent variable(y).

# read the lambda value for cross-validation
lambda_cv <- 0.00374
# Fit final model, get its sum of squared 
model_cv_lasso <- glmnet(x, y, alpha = 1, lambda = lambda_cv, standardize = TRUE)
y_hat_cv <- predict(model_cv_lasso, x_test)
ssr_cv <- t(y_test - y_hat_cv) %*% (y_test - y_hat_cv)
rsq_lasso_cv <- cor(y_test, y_hat_cv)^2

plot(model_cv_lasso, xvar = "lambda")

print(ssr_cv)
      s0
s0 18907
print(rsq_lasso_cv)
        s0
[1,] 0.952

Model Selection

Building model to predict the car price based on selected features.

Aasish_Wanted = data.frame(manufacturer_name ='Audi', model_name='A6', transmission='automatic', color='white', odometer_value = 4600, year_produced = 2015, engine_fuel='gasoline', engine_has_gas='False', engine_type = 'gasoline', engine_capacity=3.0, body_type = 'universal', has_warranty='False', state= 'owned', drivetrain='all', is_exchangeable='False', number_of_photos = 10, up_counter = 14)

pls2.pred = predict(fit3, Aasish_Wanted, type='response')
pls2.pred
    1 
19292 
# plot(testY, pls2.pred, ylim=c(-11,2), xlim=c(-11,2),main="Test Dataset", xlab="observed", ylab="PLS Predicted")
# abline(0, 1, col="red")
# pls.eval=data.frame(obs=solTestY, pred=pls.pred2[,1,1])
# defaultSummary(pls.eval)